From 191b7ad9b55ecc9f109ae57ba315dd994b066e8e Mon Sep 17 00:00:00 2001 From: hobokenchicken Date: Thu, 4 Jun 2026 15:14:26 -0400 Subject: [PATCH] fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed - URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview) - Cleaned session.update (removed modalities that may not apply) - Expanded _handle to catch input_audio_transcription.delta and .completed events - on_error now forwards transcription errors to frontend client - Per AUDIT + PLAN item 1 --- AUDIT.md | 180 +++++++++++++++++++++++++++++ PLAN.md | 157 +++++++++++++++++++++++++ backend/main.py | 4 + backend/services/whisper_stream.py | 38 ++++-- 4 files changed, 367 insertions(+), 12 deletions(-) create mode 100644 AUDIT.md create mode 100644 PLAN.md diff --git a/AUDIT.md b/AUDIT.md new file mode 100644 index 0000000..cfefbf4 --- /dev/null +++ b/AUDIT.md @@ -0,0 +1,180 @@ +# Kira Project Audit — Broken / Wrong / Doesn't Work + +**Date:** 2026-06-04 (current session state) +**Scope:** Full code + running deployment review (~/Projects/ai-body-double + docker on 172.20.1.204) +**Method:** File inspection, server logs, build verification, component cross-refs, console history from user reports. + +--- + +## Critical (Core Voice Pipeline Broken) + +1. **Realtime STT via gpt-realtime-whisper is non-functional** + - `backend/services/whisper_stream.py:44`: WS URL hardcodes `?model=gpt-4o-mini-realtime-preview` + - Server logs repeatedly: `invalid_request_error.model_not_found` + close code 4004 + - `Whisper error: The model `gpt-4o-mini-realtime-preview` does not exist or you do not have access to it.` + - Result: Pressing "Talk to Kira" streams PCM16 but backend never processes → no transcript, no response. Equivalent to the old "Could not transcribe". + - Root: OpenAI Realtime models require specific names (e.g. `gpt-4o-realtime-preview`); account may lack preview access or "gpt-realtime-whisper" transcription param is not the public way to do pure STT. + - The session.update tries `"model": "gpt-realtime-whisper"` for input_audio_transcription — this may also be invalid. + +2. **TTS audio playback is batched, not streaming to user** + - Backend streams chunks over WS (`with_streaming_response`). + - Frontend (`useConversation.ts:119-150`): pushes to `audioBufferRef`, **only combines + plays the Blob on `speaking_end`**. + - User hears **nothing** until the entire TTS response is generated and transferred. Defeats the purpose of streaming TTS for low perceived latency. + - Previous "33s delay" complaints were partly this + Honcho + full STT blob. + +3. **No live transcript UI (transcript_delta ignored)** + - Backend sends `transcript_delta` for word-by-word as she speaks (VAD on server). + - Frontend `handleMessage`: `case 'transcript_delta': // Streaming partial... break;` + - ChatBubble only shows final `transcript` messages. User has zero feedback on what the mic is hearing in real time. + +--- + +## Missing / Incomplete Features (from original spec) + +4. **Notes widget not in the UI** + - `Notes.tsx` fully implemented (add/toggle/delete local notes). + - `App.tsx:6` imports it but **never renders** it in the grid. + - BackgroundScene also imported but unused (Toolbar is a limited replacement). + +5. **No white noise / ambient sounds** + - Spec explicitly called for "white noise". + - Only lo-fi YouTube music (MusicPlayer). No rain, fan, cafe, brown noise, etc. toggles. + +6. **"Focus tools" beyond basic timer are thin** + - Timer exists (pomodoro/countdown/stopwatch) but stopwatch is broken (see below). + - No task decomposition, "flocus" (focus?) mode, habit trackers, or integration with notes/chat. + +7. **Pet is purely decorative** + - `PetZone.tsx`: static SVG cats (Mochi + Luna). No feeding, petting, state, or interaction with focus sessions. + +8. **Accessory system incomplete for Live2D** + - Wardrobe sends accessory changes. + - `KiraAvatar.tsx:196`: `// The model has hair clips... For now, the accessory system works on the fallback avatar` + - Only AnimatedAvatar (SVG) respects accessories. + +--- + +## Dead Code / Stale Architecture + +9. **Legacy service files completely unused** + - `backend/services/stt.py`, `tts.py`, `llm.py` exist and implement old REST pipeline (whisper-1 + DeepSeek + OpenAI TTS). + - Current `main.py` has everything inlined + uses `whisper_stream.py`. + - Confusing for future devs; requirements.txt still pulls old deps indirectly. + - Docs still describe the old stack. + +10. **Documentation is badly out of date** + - `README.md`: "Whisper STT → DeepSeek → OpenAI Nova" + - `ARCHITECTURE.md`: old data flow diagram with MediaRecorder + REST Whisper + DeepSeek. + - `PROJECT_SCOPE.md`: planning questions, not current reality. + - No mention of Honcho memory, Realtime WS, gpt-5.4-nano, Sage voice, streaming TTS. + +11. **Stale variables and imports** + - `backend/main.py`: `import time` (unused), `pending_transcript`, `transcript_lock` (declared, assigned in on_done but never read — leftover from aborted refactor). + - `frontend/src/hooks/useConversation.ts`: `recorderRef` declared, cleaned up, but never assigned (MediaRecorder path was replaced by PCMCapture). + - `App.tsx`: unused imports for `BackgroundScene` and `Notes`. + +--- + +## UX / Reliability / Console Problems + +12. **Deprecated audio capture** + - `startPCMCapture` uses `ScriptProcessorNode` (deprecated since 2019, console warning on every mic use). + - Should be AudioWorkletNode for production. + +13. **YouTube music player is flaky and spammy** + - Repeated console errors: + - `Failed to execute 'postMessage' on 'DOMWindow': The target origin provided ('https://www.youtube.com') does not match...` + - `Access to XMLHttpRequest at 'https://googleads.g.doubleclick.net...' blocked by CORS` + - `net::ERR_FAILED 503` + - Music often requires user gesture; ads/CSP can break playback or fill console. + +14. **Timer stopwatch mode is broken** + - `Timer.tsx:115`: insane display formula `formatTime(Math.floor((presetMinutes * 60 - remaining + presetMinutes * 60) || remaining))` + - Interval always does `r - 1`; starting at 0 for stopwatch immediately hits <=1 and stops. + - No real count-up logic. + +15. **WelcomeScreen rendering bug** + - When saved user-id exists but not yet identified: renders a full `min-h-screen` WelcomeScreen **inside** a small glass-card. CSS collision, bad layout. + +16. **Poor error visibility for mic/STT failures** + - Mic errors set `micError` state but it's only used in one place? (mostly console). + - STT connection failures produce no user-facing message — just silent "listening" that does nothing. + - `handleMessage` for 'error' only does `console.error`. + +17. **Live2D vs fallback inconsistency** + - Attempts to load `kira.model3.json` (which proxies Epsilon model + custom textures). + - Outfit swap code runs but is fragile (`(model as any).textures[2] = ...`). + - Many users will see the SVG fallback (or broken Live2D). + - Cubism deprecation warnings on load. + +18. **No real interruption / barge-in** + - User can keep talking while Kira speaks; no client "stop" signal. + - No handling of `interruption` type on backend. + +19. **Continuous streaming while "listening"** + - Toggle starts permanent PCM16 stream to Realtime WS (cost + bandwidth). + - Server VAD fires `on_done` on silences → Kira can interject multiple times unprompted. + +--- + +## Performance / Cost / Infra + +20. **LLM → TTS not overlapped** + - In `on_done` and text path: full `await chat.completions` before starting TTS stream. + - Even with fast nano, adds 1-2s serial delay before first audio chunk. + +21. **Honcho calls are expensive/slow** + - `build_system_prompt_suffix` (called once at identify) does two `peer.chat(...)` which are LLM calls inside Honcho. + - Previously caused 20-30s delays when done per-turn; now cached but still slow on first load and burns tokens. + +22. **Model access / pricing mismatch** + - Attempting Realtime WS for cheaper STT but the required preview models are either unavailable or more expensive than documented REST transcription. + - Current fallback path (if it worked) would be gpt-4o-transcribe + nano + sage. + +23. **No health / readiness for the whisper sub-connection** + - Main `/api/health` says "ok" even when whisper stream is dead. + +--- + +## Deployment / Dev Experience + +24. **Hardcoded model strings everywhere** + - Changing STT/LLM/TTS requires edits in multiple places + no central config. + +25. **No .env validation or example for Realtime keys** + - .env.example exists but current stack needs only OPENAI + HONCHO (DeepSeek keys are vestigial). + +26. **Build produces large bundles** + - Live2D cubism + pixi ~300k gzipped; no tree-shaking notes. + +27. **Local dev vs prod drift** + - Vite proxy assumes `backend` hostname (docker). + - No easy `npm run dev` + separate backend story documented. + +--- + +## Quick Wins (Prioritized) + +1. Fix Realtime model name + test the WS transcription session (or fall back to REST `gpt-4o-transcribe` + streaming deltas via another mechanism). +2. Make TTS playback incremental (use AudioContext buffer append or successive