fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed

- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview) - Cleaned session.update (removed modalities that may not apply) - Expanded _handle to catch input_audio_transcription.delta and .completed events - on_error now forwards transcription errors to frontend client - Per AUDIT + PLAN item 1
2026-06-04 15:14:26 -04:00
parent 7502f201c7
commit 191b7ad9b5
4 changed files with 367 additions and 12 deletions
@@ -0,0 +1,180 @@
+# Kira Project Audit — Broken / Wrong / Doesn't Work
+
+**Date:** 2026-06-04 (current session state)
+**Scope:** Full code + running deployment review (~/Projects/ai-body-double + docker on 172.20.1.204)
+**Method:** File inspection, server logs, build verification, component cross-refs, console history from user reports.
+
+---
+
+## Critical (Core Voice Pipeline Broken)
+
+1. **Realtime STT via gpt-realtime-whisper is non-functional**
+   - `backend/services/whisper_stream.py:44`: WS URL hardcodes `?model=gpt-4o-mini-realtime-preview`
+   - Server logs repeatedly: `invalid_request_error.model_not_found` + close code 4004
+   - `Whisper error: The model `gpt-4o-mini-realtime-preview` does not exist or you do not have access to it.`
+   - Result: Pressing "Talk to Kira" streams PCM16 but backend never processes → no transcript, no response. Equivalent to the old "Could not transcribe".
+   - Root: OpenAI Realtime models require specific names (e.g. `gpt-4o-realtime-preview`); account may lack preview access or "gpt-realtime-whisper" transcription param is not the public way to do pure STT.
+   - The session.update tries `"model": "gpt-realtime-whisper"` for input_audio_transcription — this may also be invalid.
+
+2. **TTS audio playback is batched, not streaming to user**
+   - Backend streams chunks over WS (`with_streaming_response`).
+   - Frontend (`useConversation.ts:119-150`): pushes to `audioBufferRef`, **only combines + plays the Blob on `speaking_end`**.
+   - User hears **nothing** until the entire TTS response is generated and transferred. Defeats the purpose of streaming TTS for low perceived latency.
+   - Previous "33s delay" complaints were partly this + Honcho + full STT blob.
+
+3. **No live transcript UI (transcript_delta ignored)**
+   - Backend sends `transcript_delta` for word-by-word as she speaks (VAD on server).
+   - Frontend `handleMessage`: `case 'transcript_delta': // Streaming partial... break;`
+   - ChatBubble only shows final `transcript` messages. User has zero feedback on what the mic is hearing in real time.
+
+---
+
+## Missing / Incomplete Features (from original spec)
+
+4. **Notes widget not in the UI**
+   - `Notes.tsx` fully implemented (add/toggle/delete local notes).
+   - `App.tsx:6` imports it but **never renders** it in the grid.
+   - BackgroundScene also imported but unused (Toolbar is a limited replacement).
+
+5. **No white noise / ambient sounds**
+   - Spec explicitly called for "white noise".
+   - Only lo-fi YouTube music (MusicPlayer). No rain, fan, cafe, brown noise, etc. toggles.
+
+6. **"Focus tools" beyond basic timer are thin**
+   - Timer exists (pomodoro/countdown/stopwatch) but stopwatch is broken (see below).
+   - No task decomposition, "flocus" (focus?) mode, habit trackers, or integration with notes/chat.
+
+7. **Pet is purely decorative**
+   - `PetZone.tsx`: static SVG cats (Mochi + Luna). No feeding, petting, state, or interaction with focus sessions.
+
+8. **Accessory system incomplete for Live2D**
+   - Wardrobe sends accessory changes.
+   - `KiraAvatar.tsx:196`: `// The model has hair clips... For now, the accessory system works on the fallback avatar`
+   - Only AnimatedAvatar (SVG) respects accessories.
+
+---
+
+## Dead Code / Stale Architecture
+
+9. **Legacy service files completely unused**
+   - `backend/services/stt.py`, `tts.py`, `llm.py` exist and implement old REST pipeline (whisper-1 + DeepSeek + OpenAI TTS).
+   - Current `main.py` has everything inlined + uses `whisper_stream.py`.
+   - Confusing for future devs; requirements.txt still pulls old deps indirectly.
+   - Docs still describe the old stack.
+
+10. **Documentation is badly out of date**
+    - `README.md`: "Whisper STT → DeepSeek → OpenAI Nova"
+    - `ARCHITECTURE.md`: old data flow diagram with MediaRecorder + REST Whisper + DeepSeek.
+    - `PROJECT_SCOPE.md`: planning questions, not current reality.
+    - No mention of Honcho memory, Realtime WS, gpt-5.4-nano, Sage voice, streaming TTS.
+
+11. **Stale variables and imports**
+    - `backend/main.py`: `import time` (unused), `pending_transcript`, `transcript_lock` (declared, assigned in on_done but never read — leftover from aborted refactor).
+    - `frontend/src/hooks/useConversation.ts`: `recorderRef` declared, cleaned up, but never assigned (MediaRecorder path was replaced by PCMCapture).
+    - `App.tsx`: unused imports for `BackgroundScene` and `Notes`.
+
+---
+
+## UX / Reliability / Console Problems
+
+12. **Deprecated audio capture**
+    - `startPCMCapture` uses `ScriptProcessorNode` (deprecated since 2019, console warning on every mic use).
+    - Should be AudioWorkletNode for production.
+
+13. **YouTube music player is flaky and spammy**
+    - Repeated console errors:
+      - `Failed to execute 'postMessage' on 'DOMWindow': The target origin provided ('https://www.youtube.com') does not match...`
+      - `Access to XMLHttpRequest at 'https://googleads.g.doubleclick.net...' blocked by CORS`
+      - `net::ERR_FAILED 503`
+    - Music often requires user gesture; ads/CSP can break playback or fill console.
+
+14. **Timer stopwatch mode is broken**
+    - `Timer.tsx:115`: insane display formula `formatTime(Math.floor((presetMinutes * 60 - remaining + presetMinutes * 60) || remaining))`
+    - Interval always does `r - 1`; starting at 0 for stopwatch immediately hits <=1 and stops.
+    - No real count-up logic.
+
+15. **WelcomeScreen rendering bug**
+    - When saved user-id exists but not yet identified: renders a full `min-h-screen` WelcomeScreen **inside** a small glass-card. CSS collision, bad layout.
+
+16. **Poor error visibility for mic/STT failures**
+    - Mic errors set `micError` state but it's only used in one place? (mostly console).
+    - STT connection failures produce no user-facing message — just silent "listening" that does nothing.
+    - `handleMessage` for 'error' only does `console.error`.
+
+17. **Live2D vs fallback inconsistency**
+    - Attempts to load `kira.model3.json` (which proxies Epsilon model + custom textures).
+    - Outfit swap code runs but is fragile (`(model as any).textures[2] = ...`).
+    - Many users will see the SVG fallback (or broken Live2D).
+    - Cubism deprecation warnings on load.
+
+18. **No real interruption / barge-in**
+    - User can keep talking while Kira speaks; no client "stop" signal.
+    - No handling of `interruption` type on backend.
+
+19. **Continuous streaming while "listening"**
+    - Toggle starts permanent PCM16 stream to Realtime WS (cost + bandwidth).
+    - Server VAD fires `on_done` on silences → Kira can interject multiple times unprompted.
+
+---
+
+## Performance / Cost / Infra
+
+20. **LLM → TTS not overlapped**
+    - In `on_done` and text path: full `await chat.completions` before starting TTS stream.
+    - Even with fast nano, adds 1-2s serial delay before first audio chunk.
+
+21. **Honcho calls are expensive/slow**
+    - `build_system_prompt_suffix` (called once at identify) does two `peer.chat(...)` which are LLM calls inside Honcho.
+    - Previously caused 20-30s delays when done per-turn; now cached but still slow on first load and burns tokens.
+
+22. **Model access / pricing mismatch**
+    - Attempting Realtime WS for cheaper STT but the required preview models are either unavailable or more expensive than documented REST transcription.
+    - Current fallback path (if it worked) would be gpt-4o-transcribe + nano + sage.
+
+23. **No health / readiness for the whisper sub-connection**
+    - Main `/api/health` says "ok" even when whisper stream is dead.
+
+---
+
+## Deployment / Dev Experience
+
+24. **Hardcoded model strings everywhere**
+    - Changing STT/LLM/TTS requires edits in multiple places + no central config.
+
+25. **No .env validation or example for Realtime keys**
+    - .env.example exists but current stack needs only OPENAI + HONCHO (DeepSeek keys are vestigial).
+
+26. **Build produces large bundles**
+    - Live2D cubism + pixi ~300k gzipped; no tree-shaking notes.
+
+27. **Local dev vs prod drift**
+    - Vite proxy assumes `backend` hostname (docker).
+    - No easy `npm run dev` + separate backend story documented.
+
+---
+
+## Quick Wins (Prioritized)
+
+1. Fix Realtime model name + test the WS transcription session (or fall back to REST `gpt-4o-transcribe` + streaming deltas via another mechanism).
+2. Make TTS playback incremental (use AudioContext buffer append or successive <audio> elements / MediaSource).
+3. Wire `transcript_delta` into ChatBubble as a "Kira is hearing..." live line or user speech preview.
+4. Delete or archive legacy stt/tts/llm.py and update all docs.
+5. Render `<Notes />` in the grid (replace one column or add a 5th).
+6. Add a simple white noise player (HTMLAudio with free CC0 tracks or oscillator).
+7. Replace ScriptProcessor with AudioWorklet (or at least suppress the deprecation for now).
+8. Fix stopwatch timer logic properly (separate elapsed state).
+9. Clean unused imports + recorderRef.
+10. Add basic user-facing error banner for mic/STT failures.
+
+---
+
+## Recommendations
+
+- **Short term**: Revert voice pipeline to the previous working (cheaper) REST version (`gpt-4o-transcribe` + nano + streaming TTS) until the Realtime transcription model name + access is sorted. The "pivot to Realtime WS" introduced more problems than it solved for latency.
+- **Medium**: Implement proper client VAD or use OpenAI's server VAD more cleanly; consider always-listening with push-to-interrupt.
+- **Long**: Either get proper Realtime API access or switch to a different low-latency STT (e.g. browser Web Speech API for free tier, or Deepgram/AssemblyAI).
+- Consider making the avatar always visible + voice always available rather than explicit "Talk" toggle for true body-double feel.
+
+This audit is based on code + live server state as of the last successful deploy. Re-run after fixes.
+
+Next steps: I can implement the quick wins (start with 1, 2, 3, 4) if you say the word. The old preserved todo list is now stale and should be replaced.