Files
kira/PLAN.md
T
hobokenchicken 191b7ad9b5 fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed
- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview)
- Cleaned session.update (removed modalities that may not apply)
- Expanded _handle to catch input_audio_transcription.delta and .completed events
- on_error now forwards transcription errors to frontend client
- Per AUDIT + PLAN item 1
2026-06-04 15:14:26 -04:00

157 lines
10 KiB
Markdown

# Kira Body Double — Systematic Fix Plan (from AUDIT.md)
**Created:** 2026-06-04
**Source:** Full audit of codebase, running deployment (172.20.1.204), server logs, build outputs, component cross-references, and user-reported symptoms (33s+ delays, static, "Could not transcribe", console errors).
**Goal:** Turn the current non-functional voice pipeline + incomplete UI into a reliable, low-latency body-double experience matching the original specs (realtime voice, lo-fi + white noise, ADHD tools, avatar customization, notes, clock, pet, backgrounds).
**Approach:** Prioritize by dependency and impact. Fix core voice first (STT + audio playback + feedback). Then UI completeness. Then cleanup + polish. Deploy + verify after every change where possible. Re-test end-to-end on the live site at the end.
**Principles:**
- Prefer minimal, targeted patches over big refactors.
- Keep the cheap 3-step pipeline (or Realtime where it actually helps latency).
- Verify with real tool output (builds, docker logs, git, SSH health checks).
- Update docs and todo as we go.
- If a subtask blocks, note it and move to parallelizable items.
**Total items:** 9 (matching the current todo list). Dependencies noted.
---
## 1. fix-realtime-stt (CRITICAL — blocks everything voice-related)
**Status in audit:** Mic does nothing. WS connects then 4004s with `model_not_found`.
**Root cause:** Wrong base model in WS URL (`gpt-4o-mini-realtime-preview`). `gpt-realtime-whisper` exists per OpenAI announcements but must be used as the `?model=` in the Realtime WS connect for transcription mode. Session config may also need tweaking for pure STT (no LLM responses from the realtime model).
**Options:**
A. Correct model + config for Realtime WS (preferred for true streaming partials + VAD).
B. Revert to proven REST path (`gpt-4o-transcribe`) + simulate deltas if possible (cheaper, known working from history).
**Plan steps:**
1. Read current `backend/services/whisper_stream.py` and `backend/main.py` WS handling.
2. Update WS URL to `?model=gpt-realtime-whisper` (and adjust session.update if needed — remove or set transcription model correctly).
3. Add better error logging + client-facing error on WS failure.
4. Rebuild/deploy.
5. Verify: SSH docker logs for "Whisper stream ready" + no 4004/model_not_found. Simulate or note user test.
6. If still fails after 1-2 attempts: implement Option B (switch main.py back to REST transcription + keep the streaming delta plumbing for future).
**Risks:** OpenAI access tier for Realtime Whisper. Cost (Realtime is more expensive than REST transcription).
**Success criteria:** Backend accepts audio chunks and emits `transcript_delta` + `on_done` / `transcript` without errors. No more "Could not transcribe".
**Dependencies:** None (first).
## 2. streaming-audio-playback (HIGH — fixes perceived latency)
**Status:** Chunks stream from backend but frontend only plays on `speaking_end`.
**Plan steps:**
1. Read `frontend/src/hooks/useConversation.ts` audio handling + `backend/main.py` synthesize_speech.
2. Switch frontend to play chunks incrementally:
- Option: Use a single Audio element with `srcObject` + MediaSource / SourceBuffer for Opus chunks (complex but true streaming).
- Simpler reliable: Create new `<audio>` elements or use Web Audio API to decode/play each small chunk as it arrives (low latency for first words).
- Keep lip-sync tied to `isKiraSpeaking`.
3. Update backend to send smaller chunks or keep current streaming.
4. Handle `speaking_end` only for final cleanup / state.
5. Test build + deploy.
6. Verify with logs + manual timing (first audio chunk arrives ~1-2s after LLM).
**Success criteria:** User hears Kira's voice starting within ~1-2s of her finishing speaking (first words audible while rest is still generating).
**Dependencies:** 1 (need working STT to test TTS flow).
## 3. live-transcript-ui (MEDIUM — UX for body-double feel)
**Status:** Deltas arrive but are dropped.
**Plan steps:**
1. In `useConversation.ts`: expose live partial transcript state (append deltas, clear on `transcript` or `speaking_start`).
2. Pass the live partial to `ChatBubble.tsx` (or a new "Hearing..." bubble).
3. In ChatBubble: show a temporary "user is speaking: [live text]" or "Kira hears: ..." line with typing indicator while listening.
4. Clear it when final transcript is processed or on error.
5. Style girly-pop (pink bubble or inline).
6. Build/deploy/verify.
**Success criteria:** While holding Talk (or during server VAD), user sees real-time words appearing as she speaks.
**Dependencies:** 1 (deltas only come from working STT).
## 4. integrate-notes (LOW — quick completeness win)
**Status:** Component exists and is local-state only. Dead import in App.
**Plan steps:**
1. In `App.tsx`: import is already there — add `<Notes />` to the grid (e.g., replace or add alongside Timer or in a new row/column for balance).
2. Consider making notes persistent (localStorage) so they survive refresh (optional quick win).
3. Test layout on 1920x1080.
4. Build/deploy.
**Success criteria:** Notes panel visible and functional in the main dashboard.
**Dependencies:** None.
## 5. add-white-noise (MEDIUM — matches original spec)
**Status:** Spec: "white noise". Only YouTube lofi exists.
**Plan steps:**
1. Decide sources: free public-domain / CC0 ambient tracks (rain, cafe, fan, brown noise) hosted or use Web Audio oscillator + filters for simple generated noise (no external deps, reliable).
2. Extend or new component `WhiteNoisePlayer.tsx` (similar UX to MusicPlayer: buttons for presets, volume, play/pause).
3. Add to grid in `App.tsx` (perhaps replace or share space with MusicPlayer, or add a small ambient section).
4. Integrate with scenes if possible (e.g., rain scene enables rain sound).
5. Use `<audio>` elements with loop for tracks (host small mp3/wav in public/sounds or generate).
6. For zero-asset version: use AudioContext + noise buffer (pink/brown noise generator).
7. Build + deploy.
**Success criteria:** User can toggle white noise independently of (or layered with) lofi. Matches "white noise" in original request.
**Dependencies:** 4 (grid space).
## 6. clean-legacy-dead-code (MEDIUM — hygiene)
**Status:** Old services + docs describe pre-Honcho, pre-nano, pre-Sage, pre-streaming world.
**Plan steps:**
1. Delete (or move to `archive/`) `backend/services/stt.py`, `tts.py`, `llm.py`.
2. Update `backend/requirements.txt` if any unique deps were only for them.
3. Rewrite sections in:
- `README.md` (features, quick start, pipeline, costs).
- `ARCHITECTURE.md` (data flow, stack table, current Realtime attempt).
4. Update `PLAN.md` and `AUDIT.md` references if needed.
5. Remove unused imports in `main.py` (time, pending_transcript, lock).
6. Remove dead `recorderRef` in frontend hook.
7. Commit + deploy (frontend/backend rebuild not strictly needed but do it).
**Success criteria:** No more references to the old Whisper+DeepSeek+Nova pipeline. Codebase matches reality.
**Dependencies:** Can be done in parallel with others.
## 7. fix-deprecations (MEDIUM — polish + warnings)
**Subtasks (can batch):**
a. ScriptProcessorNode → AudioWorkletNode (or at minimum wrap the deprecation in a try and log once).
b. YouTube music: suppress or fix postMessage/CORS (common YouTube IFrame issues; options: host a silent proxy, use a different lofi source like a free stream, or add `origin` handling + user gesture guards. For now, add try/catch around YT init and a note in UI).
c. Timer stopwatch: replace the hacky formula + logic with proper elapsed counter (use a separate `elapsed` state that only increments when mode==='stopwatch').
d. Any other console warnings from build or previous user reports (Pixi interaction deprecation is already noted in history; check current build).
**Plan steps:**
1. Fix one subtask at a time in frontend (Timer.tsx, useConversation.ts, MusicPlayer.tsx).
2. Rebuild frontend after changes.
3. Verify no new errors in `npm run build`.
**Success criteria:** Clean console on load + mic use. Stopwatch actually counts up.
**Dependencies:** None (frontend-only).
## 8. fix-welcome-nesting (LOW)
**Status:** When saved ID exists, WelcomeScreen (full min-h-screen) is rendered inside a card div.
**Plan steps:**
1. In `App.tsx`: adjust the conditional rendering for WelcomeScreen — either make it always full-bleed (portal or separate route-like) or simplify the "identify" flow so it doesn't nest.
2. Test both new-user and returning-user paths.
3. Build/deploy.
**Success criteria:** Welcome screen looks correct (full or nicely carded) for both paths. No layout breakage.
**Dependencies:** None.
## 9. audit-followup (FINAL)
**Plan steps:**
1. After all above: full rebuild + deploy to 172.20.1.204.
2. SSH: `docker compose logs --tail 20`, health check, git status on server.
3. Update `AUDIT.md` with "Fixed" sections + remaining known issues.
4. Update this PLAN.md with completion notes.
5. Update todo list to all completed.
6. Provide user with summary + "hard refresh and test mic" instructions.
7. Optional: add a simple test script or note for future.
**Success criteria:** Live site has working (or clearly documented fallback) voice + all spec items visible. No critical items from original audit left open.
**Dependencies:** All previous.
---
## Execution Order & Parallelism
- **Phase 1 (Voice Core — do sequentially):** 1 → 2 → 3
- **Phase 2 (UI Completeness):** 4, 5 (can start after 4)
- **Phase 3 (Cleanup & Polish — parallelizable):** 6, 7, 8
- **Phase 4:** 9 (only after all)
**Estimated effort:** 1-2 hours of focused tool use + deploys (real time longer due to builds/SSH).
**Verification at each step:**
- `cd frontend && npm run build`
- `cd backend && python -c "import main; from services... import ..."`
- `git add -A && git commit -m "..." && git push`
- `ssh root@172.20.1.204 "cd /opt/kira && git pull && docker compose up -d --build 2>&1 | tail -5"`
- `ssh ... "docker logs kira-backend --tail 15 | grep -E 'whisper|STT|error|ready'" `
- Health: `curl http://localhost:8000/api/health` on server
- For frontend changes: note that nginx serves the new dist after rebuild.
**Rollback:** Every commit is atomic. Server has restart policy. Keep a "before" tag if needed.
**Post-plan:** After 9, offer to save this workflow as a skill if it proves reusable.
Start execution now. First action: tackle item 1 (fix-realtime-stt) by correcting the model name.