Files

T

hobokenchicken 191b7ad9b5 fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed

- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview)
- Cleaned session.update (removed modalities that may not apply)
- Expanded _handle to catch input_audio_transcription.delta and .completed events
- on_error now forwards transcription errors to frontend client
- Per AUDIT + PLAN item 1

2026-06-04 15:14:26 -04:00

9.7 KiB

Raw Blame History

Kira Project Audit — Broken / Wrong / Doesn't Work

Date: 2026-06-04 (current session state) Scope: Full code + running deployment review (~/Projects/ai-body-double + docker on 172.20.1.204) Method: File inspection, server logs, build verification, component cross-refs, console history from user reports.

Critical (Core Voice Pipeline Broken)

Realtime STT via gpt-realtime-whisper is non-functional
- backend/services/whisper_stream.py:44: WS URL hardcodes ?model=gpt-4o-mini-realtime-preview
- Server logs repeatedly: invalid_request_error.model_not_found + close code 4004
- Whisper error: The model gpt-4o-mini-realtime-preview does not exist or you do not have access to it.
- Result: Pressing "Talk to Kira" streams PCM16 but backend never processes → no transcript, no response. Equivalent to the old "Could not transcribe".
- Root: OpenAI Realtime models require specific names (e.g. gpt-4o-realtime-preview); account may lack preview access or "gpt-realtime-whisper" transcription param is not the public way to do pure STT.
- The session.update tries "model": "gpt-realtime-whisper" for input_audio_transcription — this may also be invalid.
TTS audio playback is batched, not streaming to user
- Backend streams chunks over WS (with_streaming_response).
- Frontend (useConversation.ts:119-150): pushes to audioBufferRef, only combines + plays the Blob on speaking_end.
- User hears nothing until the entire TTS response is generated and transferred. Defeats the purpose of streaming TTS for low perceived latency.
- Previous "33s delay" complaints were partly this + Honcho + full STT blob.
No live transcript UI (transcript_delta ignored)
- Backend sends transcript_delta for word-by-word as she speaks (VAD on server).
- Frontend handleMessage: case 'transcript_delta': // Streaming partial... break;
- ChatBubble only shows final transcript messages. User has zero feedback on what the mic is hearing in real time.

Missing / Incomplete Features (from original spec)

Notes widget not in the UI
- Notes.tsx fully implemented (add/toggle/delete local notes).
- App.tsx:6 imports it but never renders it in the grid.
- BackgroundScene also imported but unused (Toolbar is a limited replacement).
No white noise / ambient sounds
- Spec explicitly called for "white noise".
- Only lo-fi YouTube music (MusicPlayer). No rain, fan, cafe, brown noise, etc. toggles.
"Focus tools" beyond basic timer are thin
- Timer exists (pomodoro/countdown/stopwatch) but stopwatch is broken (see below).
- No task decomposition, "flocus" (focus?) mode, habit trackers, or integration with notes/chat.
Pet is purely decorative
- PetZone.tsx: static SVG cats (Mochi + Luna). No feeding, petting, state, or interaction with focus sessions.
Accessory system incomplete for Live2D
- Wardrobe sends accessory changes.
- KiraAvatar.tsx:196: // The model has hair clips... For now, the accessory system works on the fallback avatar
- Only AnimatedAvatar (SVG) respects accessories.

Dead Code / Stale Architecture

Legacy service files completely unused
- backend/services/stt.py, tts.py, llm.py exist and implement old REST pipeline (whisper-1 + DeepSeek + OpenAI TTS).
- Current main.py has everything inlined + uses whisper_stream.py.
- Confusing for future devs; requirements.txt still pulls old deps indirectly.
- Docs still describe the old stack.
Documentation is badly out of date
- README.md: "Whisper STT → DeepSeek → OpenAI Nova"
- ARCHITECTURE.md: old data flow diagram with MediaRecorder + REST Whisper + DeepSeek.
- PROJECT_SCOPE.md: planning questions, not current reality.
- No mention of Honcho memory, Realtime WS, gpt-5.4-nano, Sage voice, streaming TTS.
Stale variables and imports
- backend/main.py: import time (unused), pending_transcript, transcript_lock (declared, assigned in on_done but never read — leftover from aborted refactor).
- frontend/src/hooks/useConversation.ts: recorderRef declared, cleaned up, but never assigned (MediaRecorder path was replaced by PCMCapture).
- App.tsx: unused imports for BackgroundScene and Notes.

UX / Reliability / Console Problems

Deprecated audio capture
- startPCMCapture uses ScriptProcessorNode (deprecated since 2019, console warning on every mic use).
- Should be AudioWorkletNode for production.
YouTube music player is flaky and spammy
- Repeated console errors:
  - Failed to execute 'postMessage' on 'DOMWindow': The target origin provided ('https://www.youtube.com') does not match...
  - Access to XMLHttpRequest at 'https://googleads.g.doubleclick.net...' blocked by CORS
  - net::ERR_FAILED 503
- Music often requires user gesture; ads/CSP can break playback or fill console.
Timer stopwatch mode is broken
- Timer.tsx:115: insane display formula formatTime(Math.floor((presetMinutes * 60 - remaining + presetMinutes * 60) || remaining))
- Interval always does r - 1; starting at 0 for stopwatch immediately hits <=1 and stops.
- No real count-up logic.
WelcomeScreen rendering bug
- When saved user-id exists but not yet identified: renders a full min-h-screen WelcomeScreen inside a small glass-card. CSS collision, bad layout.
Poor error visibility for mic/STT failures
- Mic errors set micError state but it's only used in one place? (mostly console).
- STT connection failures produce no user-facing message — just silent "listening" that does nothing.
- handleMessage for 'error' only does console.error.
Live2D vs fallback inconsistency
- Attempts to load kira.model3.json (which proxies Epsilon model + custom textures).
- Outfit swap code runs but is fragile ((model as any).textures[2] = ...).
- Many users will see the SVG fallback (or broken Live2D).
- Cubism deprecation warnings on load.
No real interruption / barge-in
- User can keep talking while Kira speaks; no client "stop" signal.
- No handling of interruption type on backend.
Continuous streaming while "listening"
- Toggle starts permanent PCM16 stream to Realtime WS (cost + bandwidth).
- Server VAD fires on_done on silences → Kira can interject multiple times unprompted.

Performance / Cost / Infra

LLM → TTS not overlapped
- In on_done and text path: full await chat.completions before starting TTS stream.
- Even with fast nano, adds 1-2s serial delay before first audio chunk.
Honcho calls are expensive/slow
- build_system_prompt_suffix (called once at identify) does two peer.chat(...) which are LLM calls inside Honcho.
- Previously caused 20-30s delays when done per-turn; now cached but still slow on first load and burns tokens.
Model access / pricing mismatch
- Attempting Realtime WS for cheaper STT but the required preview models are either unavailable or more expensive than documented REST transcription.
- Current fallback path (if it worked) would be gpt-4o-transcribe + nano + sage.
No health / readiness for the whisper sub-connection
- Main /api/health says "ok" even when whisper stream is dead.

Deployment / Dev Experience

Hardcoded model strings everywhere
- Changing STT/LLM/TTS requires edits in multiple places + no central config.
No .env validation or example for Realtime keys
- .env.example exists but current stack needs only OPENAI + HONCHO (DeepSeek keys are vestigial).
Build produces large bundles
- Live2D cubism + pixi ~300k gzipped; no tree-shaking notes.
Local dev vs prod drift
- Vite proxy assumes backend hostname (docker).
- No easy npm run dev + separate backend story documented.

Quick Wins (Prioritized)

Fix Realtime model name + test the WS transcription session (or fall back to REST gpt-4o-transcribe + streaming deltas via another mechanism).
Make TTS playback incremental (use AudioContext buffer append or successive elements / MediaSource).
Wire transcript_delta into ChatBubble as a "Kira is hearing..." live line or user speech preview.
Delete or archive legacy stt/tts/llm.py and update all docs.
Render <Notes /> in the grid (replace one column or add a 5th).
Add a simple white noise player (HTMLAudio with free CC0 tracks or oscillator).
Replace ScriptProcessor with AudioWorklet (or at least suppress the deprecation for now).
Fix stopwatch timer logic properly (separate elapsed state).
Clean unused imports + recorderRef.
Add basic user-facing error banner for mic/STT failures.

Recommendations

Short term: Revert voice pipeline to the previous working (cheaper) REST version (gpt-4o-transcribe + nano + streaming TTS) until the Realtime transcription model name + access is sorted. The "pivot to Realtime WS" introduced more problems than it solved for latency.
Medium: Implement proper client VAD or use OpenAI's server VAD more cleanly; consider always-listening with push-to-interrupt.
Long: Either get proper Realtime API access or switch to a different low-latency STT (e.g. browser Web Speech API for free tier, or Deepgram/AssemblyAI).
Consider making the avatar always visible + voice always available rather than explicit "Talk" toggle for true body-double feel.

This audit is based on code + live server state as of the last successful deploy. Re-run after fixes.

Next steps: I can implement the quick wins (start with 1, 2, 3, 4) if you say the word. The old preserved todo list is now stale and should be replaced.

9.7 KiB Raw Blame History