- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview) - Cleaned session.update (removed modalities that may not apply) - Expanded _handle to catch input_audio_transcription.delta and .completed events - on_error now forwards transcription errors to frontend client - Per AUDIT + PLAN item 1
9.7 KiB
Kira Project Audit — Broken / Wrong / Doesn't Work
Date: 2026-06-04 (current session state) Scope: Full code + running deployment review (~/Projects/ai-body-double + docker on 172.20.1.204) Method: File inspection, server logs, build verification, component cross-refs, console history from user reports.
Critical (Core Voice Pipeline Broken)
-
Realtime STT via gpt-realtime-whisper is non-functional
backend/services/whisper_stream.py:44: WS URL hardcodes?model=gpt-4o-mini-realtime-preview- Server logs repeatedly:
invalid_request_error.model_not_found+ close code 4004 Whisper error: The modelgpt-4o-mini-realtime-previewdoes not exist or you do not have access to it.- Result: Pressing "Talk to Kira" streams PCM16 but backend never processes → no transcript, no response. Equivalent to the old "Could not transcribe".
- Root: OpenAI Realtime models require specific names (e.g.
gpt-4o-realtime-preview); account may lack preview access or "gpt-realtime-whisper" transcription param is not the public way to do pure STT. - The session.update tries
"model": "gpt-realtime-whisper"for input_audio_transcription — this may also be invalid.
-
TTS audio playback is batched, not streaming to user
- Backend streams chunks over WS (
with_streaming_response). - Frontend (
useConversation.ts:119-150): pushes toaudioBufferRef, only combines + plays the Blob onspeaking_end. - User hears nothing until the entire TTS response is generated and transferred. Defeats the purpose of streaming TTS for low perceived latency.
- Previous "33s delay" complaints were partly this + Honcho + full STT blob.
- Backend streams chunks over WS (
-
No live transcript UI (transcript_delta ignored)
- Backend sends
transcript_deltafor word-by-word as she speaks (VAD on server). - Frontend
handleMessage:case 'transcript_delta': // Streaming partial... break; - ChatBubble only shows final
transcriptmessages. User has zero feedback on what the mic is hearing in real time.
- Backend sends
Missing / Incomplete Features (from original spec)
-
Notes widget not in the UI
Notes.tsxfully implemented (add/toggle/delete local notes).App.tsx:6imports it but never renders it in the grid.- BackgroundScene also imported but unused (Toolbar is a limited replacement).
-
No white noise / ambient sounds
- Spec explicitly called for "white noise".
- Only lo-fi YouTube music (MusicPlayer). No rain, fan, cafe, brown noise, etc. toggles.
-
"Focus tools" beyond basic timer are thin
- Timer exists (pomodoro/countdown/stopwatch) but stopwatch is broken (see below).
- No task decomposition, "flocus" (focus?) mode, habit trackers, or integration with notes/chat.
-
Pet is purely decorative
PetZone.tsx: static SVG cats (Mochi + Luna). No feeding, petting, state, or interaction with focus sessions.
-
Accessory system incomplete for Live2D
- Wardrobe sends accessory changes.
KiraAvatar.tsx:196:// The model has hair clips... For now, the accessory system works on the fallback avatar- Only AnimatedAvatar (SVG) respects accessories.
Dead Code / Stale Architecture
-
Legacy service files completely unused
backend/services/stt.py,tts.py,llm.pyexist and implement old REST pipeline (whisper-1 + DeepSeek + OpenAI TTS).- Current
main.pyhas everything inlined + useswhisper_stream.py. - Confusing for future devs; requirements.txt still pulls old deps indirectly.
- Docs still describe the old stack.
-
Documentation is badly out of date
README.md: "Whisper STT → DeepSeek → OpenAI Nova"ARCHITECTURE.md: old data flow diagram with MediaRecorder + REST Whisper + DeepSeek.PROJECT_SCOPE.md: planning questions, not current reality.- No mention of Honcho memory, Realtime WS, gpt-5.4-nano, Sage voice, streaming TTS.
-
Stale variables and imports
backend/main.py:import time(unused),pending_transcript,transcript_lock(declared, assigned in on_done but never read — leftover from aborted refactor).frontend/src/hooks/useConversation.ts:recorderRefdeclared, cleaned up, but never assigned (MediaRecorder path was replaced by PCMCapture).App.tsx: unused imports forBackgroundSceneandNotes.
UX / Reliability / Console Problems
-
Deprecated audio capture
startPCMCaptureusesScriptProcessorNode(deprecated since 2019, console warning on every mic use).- Should be AudioWorkletNode for production.
-
YouTube music player is flaky and spammy
- Repeated console errors:
Failed to execute 'postMessage' on 'DOMWindow': The target origin provided ('https://www.youtube.com') does not match...Access to XMLHttpRequest at 'https://googleads.g.doubleclick.net...' blocked by CORSnet::ERR_FAILED 503
- Music often requires user gesture; ads/CSP can break playback or fill console.
- Repeated console errors:
-
Timer stopwatch mode is broken
Timer.tsx:115: insane display formulaformatTime(Math.floor((presetMinutes * 60 - remaining + presetMinutes * 60) || remaining))- Interval always does
r - 1; starting at 0 for stopwatch immediately hits <=1 and stops. - No real count-up logic.
-
WelcomeScreen rendering bug
- When saved user-id exists but not yet identified: renders a full
min-h-screenWelcomeScreen inside a small glass-card. CSS collision, bad layout.
- When saved user-id exists but not yet identified: renders a full
-
Poor error visibility for mic/STT failures
- Mic errors set
micErrorstate but it's only used in one place? (mostly console). - STT connection failures produce no user-facing message — just silent "listening" that does nothing.
handleMessagefor 'error' only doesconsole.error.
- Mic errors set
-
Live2D vs fallback inconsistency
- Attempts to load
kira.model3.json(which proxies Epsilon model + custom textures). - Outfit swap code runs but is fragile (
(model as any).textures[2] = ...). - Many users will see the SVG fallback (or broken Live2D).
- Cubism deprecation warnings on load.
- Attempts to load
-
No real interruption / barge-in
- User can keep talking while Kira speaks; no client "stop" signal.
- No handling of
interruptiontype on backend.
-
Continuous streaming while "listening"
- Toggle starts permanent PCM16 stream to Realtime WS (cost + bandwidth).
- Server VAD fires
on_doneon silences → Kira can interject multiple times unprompted.
Performance / Cost / Infra
-
LLM → TTS not overlapped
- In
on_doneand text path: fullawait chat.completionsbefore starting TTS stream. - Even with fast nano, adds 1-2s serial delay before first audio chunk.
- In
-
Honcho calls are expensive/slow
build_system_prompt_suffix(called once at identify) does twopeer.chat(...)which are LLM calls inside Honcho.- Previously caused 20-30s delays when done per-turn; now cached but still slow on first load and burns tokens.
-
Model access / pricing mismatch
- Attempting Realtime WS for cheaper STT but the required preview models are either unavailable or more expensive than documented REST transcription.
- Current fallback path (if it worked) would be gpt-4o-transcribe + nano + sage.
-
No health / readiness for the whisper sub-connection
- Main
/api/healthsays "ok" even when whisper stream is dead.
- Main
Deployment / Dev Experience
-
Hardcoded model strings everywhere
- Changing STT/LLM/TTS requires edits in multiple places + no central config.
-
No .env validation or example for Realtime keys
- .env.example exists but current stack needs only OPENAI + HONCHO (DeepSeek keys are vestigial).
-
Build produces large bundles
- Live2D cubism + pixi ~300k gzipped; no tree-shaking notes.
-
Local dev vs prod drift
- Vite proxy assumes
backendhostname (docker). - No easy
npm run dev+ separate backend story documented.
- Vite proxy assumes
Quick Wins (Prioritized)
- Fix Realtime model name + test the WS transcription session (or fall back to REST
gpt-4o-transcribe+ streaming deltas via another mechanism). - Make TTS playback incremental (use AudioContext buffer append or successive elements / MediaSource).
- Wire
transcript_deltainto ChatBubble as a "Kira is hearing..." live line or user speech preview. - Delete or archive legacy stt/tts/llm.py and update all docs.
- Render
<Notes />in the grid (replace one column or add a 5th). - Add a simple white noise player (HTMLAudio with free CC0 tracks or oscillator).
- Replace ScriptProcessor with AudioWorklet (or at least suppress the deprecation for now).
- Fix stopwatch timer logic properly (separate elapsed state).
- Clean unused imports + recorderRef.
- Add basic user-facing error banner for mic/STT failures.
Recommendations
- Short term: Revert voice pipeline to the previous working (cheaper) REST version (
gpt-4o-transcribe+ nano + streaming TTS) until the Realtime transcription model name + access is sorted. The "pivot to Realtime WS" introduced more problems than it solved for latency. - Medium: Implement proper client VAD or use OpenAI's server VAD more cleanly; consider always-listening with push-to-interrupt.
- Long: Either get proper Realtime API access or switch to a different low-latency STT (e.g. browser Web Speech API for free tier, or Deepgram/AssemblyAI).
- Consider making the avatar always visible + voice always available rather than explicit "Talk" toggle for true body-double feel.
This audit is based on code + live server state as of the last successful deploy. Re-run after fixes.
Next steps: I can implement the quick wins (start with 1, 2, 3, 4) if you say the word. The old preserved todo list is now stale and should be replaced.