- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview) - Cleaned session.update (removed modalities that may not apply) - Expanded _handle to catch input_audio_transcription.delta and .completed events - on_error now forwards transcription errors to frontend client - Per AUDIT + PLAN item 1
10 KiB
Kira Body Double — Systematic Fix Plan (from AUDIT.md)
Created: 2026-06-04 Source: Full audit of codebase, running deployment (172.20.1.204), server logs, build outputs, component cross-references, and user-reported symptoms (33s+ delays, static, "Could not transcribe", console errors). Goal: Turn the current non-functional voice pipeline + incomplete UI into a reliable, low-latency body-double experience matching the original specs (realtime voice, lo-fi + white noise, ADHD tools, avatar customization, notes, clock, pet, backgrounds). Approach: Prioritize by dependency and impact. Fix core voice first (STT + audio playback + feedback). Then UI completeness. Then cleanup + polish. Deploy + verify after every change where possible. Re-test end-to-end on the live site at the end. Principles:
- Prefer minimal, targeted patches over big refactors.
- Keep the cheap 3-step pipeline (or Realtime where it actually helps latency).
- Verify with real tool output (builds, docker logs, git, SSH health checks).
- Update docs and todo as we go.
- If a subtask blocks, note it and move to parallelizable items.
Total items: 9 (matching the current todo list). Dependencies noted.
1. fix-realtime-stt (CRITICAL — blocks everything voice-related)
Status in audit: Mic does nothing. WS connects then 4004s with model_not_found.
Root cause: Wrong base model in WS URL (gpt-4o-mini-realtime-preview). gpt-realtime-whisper exists per OpenAI announcements but must be used as the ?model= in the Realtime WS connect for transcription mode. Session config may also need tweaking for pure STT (no LLM responses from the realtime model).
Options:
A. Correct model + config for Realtime WS (preferred for true streaming partials + VAD).
B. Revert to proven REST path (gpt-4o-transcribe) + simulate deltas if possible (cheaper, known working from history).
Plan steps:
- Read current
backend/services/whisper_stream.pyandbackend/main.pyWS handling. - Update WS URL to
?model=gpt-realtime-whisper(and adjust session.update if needed — remove or set transcription model correctly). - Add better error logging + client-facing error on WS failure.
- Rebuild/deploy.
- Verify: SSH docker logs for "Whisper stream ready" + no 4004/model_not_found. Simulate or note user test.
- If still fails after 1-2 attempts: implement Option B (switch main.py back to REST transcription + keep the streaming delta plumbing for future).
Risks: OpenAI access tier for Realtime Whisper. Cost (Realtime is more expensive than REST transcription).
Success criteria: Backend accepts audio chunks and emits
transcript_delta+on_done/transcriptwithout errors. No more "Could not transcribe". Dependencies: None (first).
2. streaming-audio-playback (HIGH — fixes perceived latency)
Status: Chunks stream from backend but frontend only plays on speaking_end.
Plan steps:
- Read
frontend/src/hooks/useConversation.tsaudio handling +backend/main.pysynthesize_speech. - Switch frontend to play chunks incrementally:
- Option: Use a single Audio element with
srcObject+ MediaSource / SourceBuffer for Opus chunks (complex but true streaming). - Simpler reliable: Create new
<audio>elements or use Web Audio API to decode/play each small chunk as it arrives (low latency for first words). - Keep lip-sync tied to
isKiraSpeaking.
- Option: Use a single Audio element with
- Update backend to send smaller chunks or keep current streaming.
- Handle
speaking_endonly for final cleanup / state. - Test build + deploy.
- Verify with logs + manual timing (first audio chunk arrives ~1-2s after LLM). Success criteria: User hears Kira's voice starting within ~1-2s of her finishing speaking (first words audible while rest is still generating). Dependencies: 1 (need working STT to test TTS flow).
3. live-transcript-ui (MEDIUM — UX for body-double feel)
Status: Deltas arrive but are dropped. Plan steps:
- In
useConversation.ts: expose live partial transcript state (append deltas, clear ontranscriptorspeaking_start). - Pass the live partial to
ChatBubble.tsx(or a new "Hearing..." bubble). - In ChatBubble: show a temporary "user is speaking: [live text]" or "Kira hears: ..." line with typing indicator while listening.
- Clear it when final transcript is processed or on error.
- Style girly-pop (pink bubble or inline).
- Build/deploy/verify. Success criteria: While holding Talk (or during server VAD), user sees real-time words appearing as she speaks. Dependencies: 1 (deltas only come from working STT).
4. integrate-notes (LOW — quick completeness win)
Status: Component exists and is local-state only. Dead import in App. Plan steps:
- In
App.tsx: import is already there — add<Notes />to the grid (e.g., replace or add alongside Timer or in a new row/column for balance). - Consider making notes persistent (localStorage) so they survive refresh (optional quick win).
- Test layout on 1920x1080.
- Build/deploy. Success criteria: Notes panel visible and functional in the main dashboard. Dependencies: None.
5. add-white-noise (MEDIUM — matches original spec)
Status: Spec: "white noise". Only YouTube lofi exists. Plan steps:
- Decide sources: free public-domain / CC0 ambient tracks (rain, cafe, fan, brown noise) hosted or use Web Audio oscillator + filters for simple generated noise (no external deps, reliable).
- Extend or new component
WhiteNoisePlayer.tsx(similar UX to MusicPlayer: buttons for presets, volume, play/pause). - Add to grid in
App.tsx(perhaps replace or share space with MusicPlayer, or add a small ambient section). - Integrate with scenes if possible (e.g., rain scene enables rain sound).
- Use
<audio>elements with loop for tracks (host small mp3/wav in public/sounds or generate). - For zero-asset version: use AudioContext + noise buffer (pink/brown noise generator).
- Build + deploy. Success criteria: User can toggle white noise independently of (or layered with) lofi. Matches "white noise" in original request. Dependencies: 4 (grid space).
6. clean-legacy-dead-code (MEDIUM — hygiene)
Status: Old services + docs describe pre-Honcho, pre-nano, pre-Sage, pre-streaming world. Plan steps:
- Delete (or move to
archive/)backend/services/stt.py,tts.py,llm.py. - Update
backend/requirements.txtif any unique deps were only for them. - Rewrite sections in:
README.md(features, quick start, pipeline, costs).ARCHITECTURE.md(data flow, stack table, current Realtime attempt).
- Update
PLAN.mdandAUDIT.mdreferences if needed. - Remove unused imports in
main.py(time, pending_transcript, lock). - Remove dead
recorderRefin frontend hook. - Commit + deploy (frontend/backend rebuild not strictly needed but do it). Success criteria: No more references to the old Whisper+DeepSeek+Nova pipeline. Codebase matches reality. Dependencies: Can be done in parallel with others.
7. fix-deprecations (MEDIUM — polish + warnings)
Subtasks (can batch):
a. ScriptProcessorNode → AudioWorkletNode (or at minimum wrap the deprecation in a try and log once).
b. YouTube music: suppress or fix postMessage/CORS (common YouTube IFrame issues; options: host a silent proxy, use a different lofi source like a free stream, or add origin handling + user gesture guards. For now, add try/catch around YT init and a note in UI).
c. Timer stopwatch: replace the hacky formula + logic with proper elapsed counter (use a separate elapsed state that only increments when mode==='stopwatch').
d. Any other console warnings from build or previous user reports (Pixi interaction deprecation is already noted in history; check current build).
Plan steps:
- Fix one subtask at a time in frontend (Timer.tsx, useConversation.ts, MusicPlayer.tsx).
- Rebuild frontend after changes.
- Verify no new errors in
npm run build. Success criteria: Clean console on load + mic use. Stopwatch actually counts up. Dependencies: None (frontend-only).
8. fix-welcome-nesting (LOW)
Status: When saved ID exists, WelcomeScreen (full min-h-screen) is rendered inside a card div. Plan steps:
- In
App.tsx: adjust the conditional rendering for WelcomeScreen — either make it always full-bleed (portal or separate route-like) or simplify the "identify" flow so it doesn't nest. - Test both new-user and returning-user paths.
- Build/deploy. Success criteria: Welcome screen looks correct (full or nicely carded) for both paths. No layout breakage. Dependencies: None.
9. audit-followup (FINAL)
Plan steps:
- After all above: full rebuild + deploy to 172.20.1.204.
- SSH:
docker compose logs --tail 20, health check, git status on server. - Update
AUDIT.mdwith "Fixed" sections + remaining known issues. - Update this PLAN.md with completion notes.
- Update todo list to all completed.
- Provide user with summary + "hard refresh and test mic" instructions.
- Optional: add a simple test script or note for future. Success criteria: Live site has working (or clearly documented fallback) voice + all spec items visible. No critical items from original audit left open. Dependencies: All previous.
Execution Order & Parallelism
- Phase 1 (Voice Core — do sequentially): 1 → 2 → 3
- Phase 2 (UI Completeness): 4, 5 (can start after 4)
- Phase 3 (Cleanup & Polish — parallelizable): 6, 7, 8
- Phase 4: 9 (only after all)
Estimated effort: 1-2 hours of focused tool use + deploys (real time longer due to builds/SSH).
Verification at each step:
cd frontend && npm run buildcd backend && python -c "import main; from services... import ..."git add -A && git commit -m "..." && git pushssh root@172.20.1.204 "cd /opt/kira && git pull && docker compose up -d --build 2>&1 | tail -5"ssh ... "docker logs kira-backend --tail 15 | grep -E 'whisper|STT|error|ready'"- Health:
curl http://localhost:8000/api/healthon server - For frontend changes: note that nginx serves the new dist after rebuild.
Rollback: Every commit is atomic. Server has restart policy. Keep a "before" tag if needed.
Post-plan: After 9, offer to save this workflow as a skill if it proves reusable.
Start execution now. First action: tackle item 1 (fix-realtime-stt) by correcting the model name.