Files

T

hobokenchicken 191b7ad9b5 fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed

- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview)
- Cleaned session.update (removed modalities that may not apply)
- Expanded _handle to catch input_audio_transcription.delta and .completed events
- on_error now forwards transcription errors to frontend client
- Per AUDIT + PLAN item 1

2026-06-04 15:14:26 -04:00

10 KiB

Raw Blame History

Kira Body Double — Systematic Fix Plan (from AUDIT.md)

Created: 2026-06-04 Source: Full audit of codebase, running deployment (172.20.1.204), server logs, build outputs, component cross-references, and user-reported symptoms (33s+ delays, static, "Could not transcribe", console errors). Goal: Turn the current non-functional voice pipeline + incomplete UI into a reliable, low-latency body-double experience matching the original specs (realtime voice, lo-fi + white noise, ADHD tools, avatar customization, notes, clock, pet, backgrounds). Approach: Prioritize by dependency and impact. Fix core voice first (STT + audio playback + feedback). Then UI completeness. Then cleanup + polish. Deploy + verify after every change where possible. Re-test end-to-end on the live site at the end. Principles:

Prefer minimal, targeted patches over big refactors.
Keep the cheap 3-step pipeline (or Realtime where it actually helps latency).
Verify with real tool output (builds, docker logs, git, SSH health checks).
Update docs and todo as we go.
If a subtask blocks, note it and move to parallelizable items.

Total items: 9 (matching the current todo list). Dependencies noted.

Status in audit: Mic does nothing. WS connects then 4004s with model_not_found. Root cause: Wrong base model in WS URL (gpt-4o-mini-realtime-preview). gpt-realtime-whisper exists per OpenAI announcements but must be used as the ?model= in the Realtime WS connect for transcription mode. Session config may also need tweaking for pure STT (no LLM responses from the realtime model). Options: A. Correct model + config for Realtime WS (preferred for true streaming partials + VAD). B. Revert to proven REST path (gpt-4o-transcribe) + simulate deltas if possible (cheaper, known working from history). Plan steps:

Read current backend/services/whisper_stream.py and backend/main.py WS handling.
Update WS URL to ?model=gpt-realtime-whisper (and adjust session.update if needed — remove or set transcription model correctly).
Add better error logging + client-facing error on WS failure.
Rebuild/deploy.
Verify: SSH docker logs for "Whisper stream ready" + no 4004/model_not_found. Simulate or note user test.
If still fails after 1-2 attempts: implement Option B (switch main.py back to REST transcription + keep the streaming delta plumbing for future). Risks: OpenAI access tier for Realtime Whisper. Cost (Realtime is more expensive than REST transcription). Success criteria: Backend accepts audio chunks and emits transcript_delta + on_done / transcript without errors. No more "Could not transcribe". Dependencies: None (first).

2. streaming-audio-playback (HIGH — fixes perceived latency)

Status: Chunks stream from backend but frontend only plays on speaking_end. Plan steps:

Read frontend/src/hooks/useConversation.ts audio handling + backend/main.py synthesize_speech.
Switch frontend to play chunks incrementally:
- Option: Use a single Audio element with srcObject + MediaSource / SourceBuffer for Opus chunks (complex but true streaming).
- Simpler reliable: Create new <audio> elements or use Web Audio API to decode/play each small chunk as it arrives (low latency for first words).
- Keep lip-sync tied to isKiraSpeaking.
Update backend to send smaller chunks or keep current streaming.
Handle speaking_end only for final cleanup / state.
Test build + deploy.
Verify with logs + manual timing (first audio chunk arrives ~1-2s after LLM). Success criteria: User hears Kira's voice starting within ~1-2s of her finishing speaking (first words audible while rest is still generating). Dependencies: 1 (need working STT to test TTS flow).

3. live-transcript-ui (MEDIUM — UX for body-double feel)

Status: Deltas arrive but are dropped. Plan steps:

In useConversation.ts: expose live partial transcript state (append deltas, clear on transcript or speaking_start).
Pass the live partial to ChatBubble.tsx (or a new "Hearing..." bubble).
In ChatBubble: show a temporary "user is speaking: [live text]" or "Kira hears: ..." line with typing indicator while listening.
Clear it when final transcript is processed or on error.
Style girly-pop (pink bubble or inline).
Build/deploy/verify. Success criteria: While holding Talk (or during server VAD), user sees real-time words appearing as she speaks. Dependencies: 1 (deltas only come from working STT).

4. integrate-notes (LOW — quick completeness win)

Status: Component exists and is local-state only. Dead import in App. Plan steps:

In App.tsx: import is already there — add <Notes /> to the grid (e.g., replace or add alongside Timer or in a new row/column for balance).
Consider making notes persistent (localStorage) so they survive refresh (optional quick win).
Test layout on 1920x1080.
Build/deploy. Success criteria: Notes panel visible and functional in the main dashboard. Dependencies: None.

5. add-white-noise (MEDIUM — matches original spec)

Status: Spec: "white noise". Only YouTube lofi exists. Plan steps:

Decide sources: free public-domain / CC0 ambient tracks (rain, cafe, fan, brown noise) hosted or use Web Audio oscillator + filters for simple generated noise (no external deps, reliable).
Extend or new component WhiteNoisePlayer.tsx (similar UX to MusicPlayer: buttons for presets, volume, play/pause).
Add to grid in App.tsx (perhaps replace or share space with MusicPlayer, or add a small ambient section).
Integrate with scenes if possible (e.g., rain scene enables rain sound).
Use <audio> elements with loop for tracks (host small mp3/wav in public/sounds or generate).
For zero-asset version: use AudioContext + noise buffer (pink/brown noise generator).
Build + deploy. Success criteria: User can toggle white noise independently of (or layered with) lofi. Matches "white noise" in original request. Dependencies: 4 (grid space).

6. clean-legacy-dead-code (MEDIUM — hygiene)

Status: Old services + docs describe pre-Honcho, pre-nano, pre-Sage, pre-streaming world. Plan steps:

Delete (or move to archive/) backend/services/stt.py, tts.py, llm.py.
Update backend/requirements.txt if any unique deps were only for them.
Rewrite sections in:
- README.md (features, quick start, pipeline, costs).
- ARCHITECTURE.md (data flow, stack table, current Realtime attempt).
Update PLAN.md and AUDIT.md references if needed.
Remove unused imports in main.py (time, pending_transcript, lock).
Remove dead recorderRef in frontend hook.
Commit + deploy (frontend/backend rebuild not strictly needed but do it). Success criteria: No more references to the old Whisper+DeepSeek+Nova pipeline. Codebase matches reality. Dependencies: Can be done in parallel with others.

7. fix-deprecations (MEDIUM — polish + warnings)

Subtasks (can batch): a. ScriptProcessorNode → AudioWorkletNode (or at minimum wrap the deprecation in a try and log once). b. YouTube music: suppress or fix postMessage/CORS (common YouTube IFrame issues; options: host a silent proxy, use a different lofi source like a free stream, or add origin handling + user gesture guards. For now, add try/catch around YT init and a note in UI). c. Timer stopwatch: replace the hacky formula + logic with proper elapsed counter (use a separate elapsed state that only increments when mode==='stopwatch'). d. Any other console warnings from build or previous user reports (Pixi interaction deprecation is already noted in history; check current build). Plan steps:

Fix one subtask at a time in frontend (Timer.tsx, useConversation.ts, MusicPlayer.tsx).
Rebuild frontend after changes.
Verify no new errors in npm run build. Success criteria: Clean console on load + mic use. Stopwatch actually counts up. Dependencies: None (frontend-only).

8. fix-welcome-nesting (LOW)

Status: When saved ID exists, WelcomeScreen (full min-h-screen) is rendered inside a card div. Plan steps:

In App.tsx: adjust the conditional rendering for WelcomeScreen — either make it always full-bleed (portal or separate route-like) or simplify the "identify" flow so it doesn't nest.
Test both new-user and returning-user paths.
Build/deploy. Success criteria: Welcome screen looks correct (full or nicely carded) for both paths. No layout breakage. Dependencies: None.

9. audit-followup (FINAL)

Plan steps:

After all above: full rebuild + deploy to 172.20.1.204.
SSH: docker compose logs --tail 20, health check, git status on server.
Update AUDIT.md with "Fixed" sections + remaining known issues.
Update this PLAN.md with completion notes.
Update todo list to all completed.
Provide user with summary + "hard refresh and test mic" instructions.
Optional: add a simple test script or note for future. Success criteria: Live site has working (or clearly documented fallback) voice + all spec items visible. No critical items from original audit left open. Dependencies: All previous.

Execution Order & Parallelism

Phase 1 (Voice Core — do sequentially): 1 → 2 → 3
Phase 2 (UI Completeness): 4, 5 (can start after 4)
Phase 3 (Cleanup & Polish — parallelizable): 6, 7, 8
Phase 4: 9 (only after all)

Estimated effort: 1-2 hours of focused tool use + deploys (real time longer due to builds/SSH).

Verification at each step:

cd frontend && npm run build
cd backend && python -c "import main; from services... import ..."
git add -A && git commit -m "..." && git push
ssh root@172.20.1.204 "cd /opt/kira && git pull && docker compose up -d --build 2>&1 | tail -5"
ssh ... "docker logs kira-backend --tail 15 | grep -E 'whisper|STT|error|ready'"
Health: curl http://localhost:8000/api/health on server
For frontend changes: note that nginx serves the new dist after rebuild.

Rollback: Every commit is atomic. Server has restart policy. Keep a "before" tag if needed.

Post-plan: After 9, offer to save this workflow as a skill if it proves reusable.

Start execution now. First action: tackle item 1 (fix-realtime-stt) by correcting the model name.

10 KiB Raw Blame History