fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed

- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview)
- Cleaned session.update (removed modalities that may not apply)
- Expanded _handle to catch input_audio_transcription.delta and .completed events
- on_error now forwards transcription errors to frontend client
- Per AUDIT + PLAN item 1
This commit is contained in:
2026-06-04 15:14:26 -04:00
parent 7502f201c7
commit 191b7ad9b5
4 changed files with 367 additions and 12 deletions
+180
View File
@@ -0,0 +1,180 @@
# Kira Project Audit — Broken / Wrong / Doesn't Work
**Date:** 2026-06-04 (current session state)
**Scope:** Full code + running deployment review (~/Projects/ai-body-double + docker on 172.20.1.204)
**Method:** File inspection, server logs, build verification, component cross-refs, console history from user reports.
---
## Critical (Core Voice Pipeline Broken)
1. **Realtime STT via gpt-realtime-whisper is non-functional**
- `backend/services/whisper_stream.py:44`: WS URL hardcodes `?model=gpt-4o-mini-realtime-preview`
- Server logs repeatedly: `invalid_request_error.model_not_found` + close code 4004
- `Whisper error: The model `gpt-4o-mini-realtime-preview` does not exist or you do not have access to it.`
- Result: Pressing "Talk to Kira" streams PCM16 but backend never processes → no transcript, no response. Equivalent to the old "Could not transcribe".
- Root: OpenAI Realtime models require specific names (e.g. `gpt-4o-realtime-preview`); account may lack preview access or "gpt-realtime-whisper" transcription param is not the public way to do pure STT.
- The session.update tries `"model": "gpt-realtime-whisper"` for input_audio_transcription — this may also be invalid.
2. **TTS audio playback is batched, not streaming to user**
- Backend streams chunks over WS (`with_streaming_response`).
- Frontend (`useConversation.ts:119-150`): pushes to `audioBufferRef`, **only combines + plays the Blob on `speaking_end`**.
- User hears **nothing** until the entire TTS response is generated and transferred. Defeats the purpose of streaming TTS for low perceived latency.
- Previous "33s delay" complaints were partly this + Honcho + full STT blob.
3. **No live transcript UI (transcript_delta ignored)**
- Backend sends `transcript_delta` for word-by-word as she speaks (VAD on server).
- Frontend `handleMessage`: `case 'transcript_delta': // Streaming partial... break;`
- ChatBubble only shows final `transcript` messages. User has zero feedback on what the mic is hearing in real time.
---
## Missing / Incomplete Features (from original spec)
4. **Notes widget not in the UI**
- `Notes.tsx` fully implemented (add/toggle/delete local notes).
- `App.tsx:6` imports it but **never renders** it in the grid.
- BackgroundScene also imported but unused (Toolbar is a limited replacement).
5. **No white noise / ambient sounds**
- Spec explicitly called for "white noise".
- Only lo-fi YouTube music (MusicPlayer). No rain, fan, cafe, brown noise, etc. toggles.
6. **"Focus tools" beyond basic timer are thin**
- Timer exists (pomodoro/countdown/stopwatch) but stopwatch is broken (see below).
- No task decomposition, "flocus" (focus?) mode, habit trackers, or integration with notes/chat.
7. **Pet is purely decorative**
- `PetZone.tsx`: static SVG cats (Mochi + Luna). No feeding, petting, state, or interaction with focus sessions.
8. **Accessory system incomplete for Live2D**
- Wardrobe sends accessory changes.
- `KiraAvatar.tsx:196`: `// The model has hair clips... For now, the accessory system works on the fallback avatar`
- Only AnimatedAvatar (SVG) respects accessories.
---
## Dead Code / Stale Architecture
9. **Legacy service files completely unused**
- `backend/services/stt.py`, `tts.py`, `llm.py` exist and implement old REST pipeline (whisper-1 + DeepSeek + OpenAI TTS).
- Current `main.py` has everything inlined + uses `whisper_stream.py`.
- Confusing for future devs; requirements.txt still pulls old deps indirectly.
- Docs still describe the old stack.
10. **Documentation is badly out of date**
- `README.md`: "Whisper STT → DeepSeek → OpenAI Nova"
- `ARCHITECTURE.md`: old data flow diagram with MediaRecorder + REST Whisper + DeepSeek.
- `PROJECT_SCOPE.md`: planning questions, not current reality.
- No mention of Honcho memory, Realtime WS, gpt-5.4-nano, Sage voice, streaming TTS.
11. **Stale variables and imports**
- `backend/main.py`: `import time` (unused), `pending_transcript`, `transcript_lock` (declared, assigned in on_done but never read — leftover from aborted refactor).
- `frontend/src/hooks/useConversation.ts`: `recorderRef` declared, cleaned up, but never assigned (MediaRecorder path was replaced by PCMCapture).
- `App.tsx`: unused imports for `BackgroundScene` and `Notes`.
---
## UX / Reliability / Console Problems
12. **Deprecated audio capture**
- `startPCMCapture` uses `ScriptProcessorNode` (deprecated since 2019, console warning on every mic use).
- Should be AudioWorkletNode for production.
13. **YouTube music player is flaky and spammy**
- Repeated console errors:
- `Failed to execute 'postMessage' on 'DOMWindow': The target origin provided ('https://www.youtube.com') does not match...`
- `Access to XMLHttpRequest at 'https://googleads.g.doubleclick.net...' blocked by CORS`
- `net::ERR_FAILED 503`
- Music often requires user gesture; ads/CSP can break playback or fill console.
14. **Timer stopwatch mode is broken**
- `Timer.tsx:115`: insane display formula `formatTime(Math.floor((presetMinutes * 60 - remaining + presetMinutes * 60) || remaining))`
- Interval always does `r - 1`; starting at 0 for stopwatch immediately hits <=1 and stops.
- No real count-up logic.
15. **WelcomeScreen rendering bug**
- When saved user-id exists but not yet identified: renders a full `min-h-screen` WelcomeScreen **inside** a small glass-card. CSS collision, bad layout.
16. **Poor error visibility for mic/STT failures**
- Mic errors set `micError` state but it's only used in one place? (mostly console).
- STT connection failures produce no user-facing message — just silent "listening" that does nothing.
- `handleMessage` for 'error' only does `console.error`.
17. **Live2D vs fallback inconsistency**
- Attempts to load `kira.model3.json` (which proxies Epsilon model + custom textures).
- Outfit swap code runs but is fragile (`(model as any).textures[2] = ...`).
- Many users will see the SVG fallback (or broken Live2D).
- Cubism deprecation warnings on load.
18. **No real interruption / barge-in**
- User can keep talking while Kira speaks; no client "stop" signal.
- No handling of `interruption` type on backend.
19. **Continuous streaming while "listening"**
- Toggle starts permanent PCM16 stream to Realtime WS (cost + bandwidth).
- Server VAD fires `on_done` on silences → Kira can interject multiple times unprompted.
---
## Performance / Cost / Infra
20. **LLM → TTS not overlapped**
- In `on_done` and text path: full `await chat.completions` before starting TTS stream.
- Even with fast nano, adds 1-2s serial delay before first audio chunk.
21. **Honcho calls are expensive/slow**
- `build_system_prompt_suffix` (called once at identify) does two `peer.chat(...)` which are LLM calls inside Honcho.
- Previously caused 20-30s delays when done per-turn; now cached but still slow on first load and burns tokens.
22. **Model access / pricing mismatch**
- Attempting Realtime WS for cheaper STT but the required preview models are either unavailable or more expensive than documented REST transcription.
- Current fallback path (if it worked) would be gpt-4o-transcribe + nano + sage.
23. **No health / readiness for the whisper sub-connection**
- Main `/api/health` says "ok" even when whisper stream is dead.
---
## Deployment / Dev Experience
24. **Hardcoded model strings everywhere**
- Changing STT/LLM/TTS requires edits in multiple places + no central config.
25. **No .env validation or example for Realtime keys**
- .env.example exists but current stack needs only OPENAI + HONCHO (DeepSeek keys are vestigial).
26. **Build produces large bundles**
- Live2D cubism + pixi ~300k gzipped; no tree-shaking notes.
27. **Local dev vs prod drift**
- Vite proxy assumes `backend` hostname (docker).
- No easy `npm run dev` + separate backend story documented.
---
## Quick Wins (Prioritized)
1. Fix Realtime model name + test the WS transcription session (or fall back to REST `gpt-4o-transcribe` + streaming deltas via another mechanism).
2. Make TTS playback incremental (use AudioContext buffer append or successive <audio> elements / MediaSource).
3. Wire `transcript_delta` into ChatBubble as a "Kira is hearing..." live line or user speech preview.
4. Delete or archive legacy stt/tts/llm.py and update all docs.
5. Render `<Notes />` in the grid (replace one column or add a 5th).
6. Add a simple white noise player (HTMLAudio with free CC0 tracks or oscillator).
7. Replace ScriptProcessor with AudioWorklet (or at least suppress the deprecation for now).
8. Fix stopwatch timer logic properly (separate elapsed state).
9. Clean unused imports + recorderRef.
10. Add basic user-facing error banner for mic/STT failures.
---
## Recommendations
- **Short term**: Revert voice pipeline to the previous working (cheaper) REST version (`gpt-4o-transcribe` + nano + streaming TTS) until the Realtime transcription model name + access is sorted. The "pivot to Realtime WS" introduced more problems than it solved for latency.
- **Medium**: Implement proper client VAD or use OpenAI's server VAD more cleanly; consider always-listening with push-to-interrupt.
- **Long**: Either get proper Realtime API access or switch to a different low-latency STT (e.g. browser Web Speech API for free tier, or Deepgram/AssemblyAI).
- Consider making the avatar always visible + voice always available rather than explicit "Talk" toggle for true body-double feel.
This audit is based on code + live server state as of the last successful deploy. Re-run after fixes.
Next steps: I can implement the quick wins (start with 1, 2, 3, 4) if you say the word. The old preserved todo list is now stale and should be replaced.
+157
View File
@@ -0,0 +1,157 @@
# Kira Body Double — Systematic Fix Plan (from AUDIT.md)
**Created:** 2026-06-04
**Source:** Full audit of codebase, running deployment (172.20.1.204), server logs, build outputs, component cross-references, and user-reported symptoms (33s+ delays, static, "Could not transcribe", console errors).
**Goal:** Turn the current non-functional voice pipeline + incomplete UI into a reliable, low-latency body-double experience matching the original specs (realtime voice, lo-fi + white noise, ADHD tools, avatar customization, notes, clock, pet, backgrounds).
**Approach:** Prioritize by dependency and impact. Fix core voice first (STT + audio playback + feedback). Then UI completeness. Then cleanup + polish. Deploy + verify after every change where possible. Re-test end-to-end on the live site at the end.
**Principles:**
- Prefer minimal, targeted patches over big refactors.
- Keep the cheap 3-step pipeline (or Realtime where it actually helps latency).
- Verify with real tool output (builds, docker logs, git, SSH health checks).
- Update docs and todo as we go.
- If a subtask blocks, note it and move to parallelizable items.
**Total items:** 9 (matching the current todo list). Dependencies noted.
---
## 1. fix-realtime-stt (CRITICAL — blocks everything voice-related)
**Status in audit:** Mic does nothing. WS connects then 4004s with `model_not_found`.
**Root cause:** Wrong base model in WS URL (`gpt-4o-mini-realtime-preview`). `gpt-realtime-whisper` exists per OpenAI announcements but must be used as the `?model=` in the Realtime WS connect for transcription mode. Session config may also need tweaking for pure STT (no LLM responses from the realtime model).
**Options:**
A. Correct model + config for Realtime WS (preferred for true streaming partials + VAD).
B. Revert to proven REST path (`gpt-4o-transcribe`) + simulate deltas if possible (cheaper, known working from history).
**Plan steps:**
1. Read current `backend/services/whisper_stream.py` and `backend/main.py` WS handling.
2. Update WS URL to `?model=gpt-realtime-whisper` (and adjust session.update if needed — remove or set transcription model correctly).
3. Add better error logging + client-facing error on WS failure.
4. Rebuild/deploy.
5. Verify: SSH docker logs for "Whisper stream ready" + no 4004/model_not_found. Simulate or note user test.
6. If still fails after 1-2 attempts: implement Option B (switch main.py back to REST transcription + keep the streaming delta plumbing for future).
**Risks:** OpenAI access tier for Realtime Whisper. Cost (Realtime is more expensive than REST transcription).
**Success criteria:** Backend accepts audio chunks and emits `transcript_delta` + `on_done` / `transcript` without errors. No more "Could not transcribe".
**Dependencies:** None (first).
## 2. streaming-audio-playback (HIGH — fixes perceived latency)
**Status:** Chunks stream from backend but frontend only plays on `speaking_end`.
**Plan steps:**
1. Read `frontend/src/hooks/useConversation.ts` audio handling + `backend/main.py` synthesize_speech.
2. Switch frontend to play chunks incrementally:
- Option: Use a single Audio element with `srcObject` + MediaSource / SourceBuffer for Opus chunks (complex but true streaming).
- Simpler reliable: Create new `<audio>` elements or use Web Audio API to decode/play each small chunk as it arrives (low latency for first words).
- Keep lip-sync tied to `isKiraSpeaking`.
3. Update backend to send smaller chunks or keep current streaming.
4. Handle `speaking_end` only for final cleanup / state.
5. Test build + deploy.
6. Verify with logs + manual timing (first audio chunk arrives ~1-2s after LLM).
**Success criteria:** User hears Kira's voice starting within ~1-2s of her finishing speaking (first words audible while rest is still generating).
**Dependencies:** 1 (need working STT to test TTS flow).
## 3. live-transcript-ui (MEDIUM — UX for body-double feel)
**Status:** Deltas arrive but are dropped.
**Plan steps:**
1. In `useConversation.ts`: expose live partial transcript state (append deltas, clear on `transcript` or `speaking_start`).
2. Pass the live partial to `ChatBubble.tsx` (or a new "Hearing..." bubble).
3. In ChatBubble: show a temporary "user is speaking: [live text]" or "Kira hears: ..." line with typing indicator while listening.
4. Clear it when final transcript is processed or on error.
5. Style girly-pop (pink bubble or inline).
6. Build/deploy/verify.
**Success criteria:** While holding Talk (or during server VAD), user sees real-time words appearing as she speaks.
**Dependencies:** 1 (deltas only come from working STT).
## 4. integrate-notes (LOW — quick completeness win)
**Status:** Component exists and is local-state only. Dead import in App.
**Plan steps:**
1. In `App.tsx`: import is already there — add `<Notes />` to the grid (e.g., replace or add alongside Timer or in a new row/column for balance).
2. Consider making notes persistent (localStorage) so they survive refresh (optional quick win).
3. Test layout on 1920x1080.
4. Build/deploy.
**Success criteria:** Notes panel visible and functional in the main dashboard.
**Dependencies:** None.
## 5. add-white-noise (MEDIUM — matches original spec)
**Status:** Spec: "white noise". Only YouTube lofi exists.
**Plan steps:**
1. Decide sources: free public-domain / CC0 ambient tracks (rain, cafe, fan, brown noise) hosted or use Web Audio oscillator + filters for simple generated noise (no external deps, reliable).
2. Extend or new component `WhiteNoisePlayer.tsx` (similar UX to MusicPlayer: buttons for presets, volume, play/pause).
3. Add to grid in `App.tsx` (perhaps replace or share space with MusicPlayer, or add a small ambient section).
4. Integrate with scenes if possible (e.g., rain scene enables rain sound).
5. Use `<audio>` elements with loop for tracks (host small mp3/wav in public/sounds or generate).
6. For zero-asset version: use AudioContext + noise buffer (pink/brown noise generator).
7. Build + deploy.
**Success criteria:** User can toggle white noise independently of (or layered with) lofi. Matches "white noise" in original request.
**Dependencies:** 4 (grid space).
## 6. clean-legacy-dead-code (MEDIUM — hygiene)
**Status:** Old services + docs describe pre-Honcho, pre-nano, pre-Sage, pre-streaming world.
**Plan steps:**
1. Delete (or move to `archive/`) `backend/services/stt.py`, `tts.py`, `llm.py`.
2. Update `backend/requirements.txt` if any unique deps were only for them.
3. Rewrite sections in:
- `README.md` (features, quick start, pipeline, costs).
- `ARCHITECTURE.md` (data flow, stack table, current Realtime attempt).
4. Update `PLAN.md` and `AUDIT.md` references if needed.
5. Remove unused imports in `main.py` (time, pending_transcript, lock).
6. Remove dead `recorderRef` in frontend hook.
7. Commit + deploy (frontend/backend rebuild not strictly needed but do it).
**Success criteria:** No more references to the old Whisper+DeepSeek+Nova pipeline. Codebase matches reality.
**Dependencies:** Can be done in parallel with others.
## 7. fix-deprecations (MEDIUM — polish + warnings)
**Subtasks (can batch):**
a. ScriptProcessorNode → AudioWorkletNode (or at minimum wrap the deprecation in a try and log once).
b. YouTube music: suppress or fix postMessage/CORS (common YouTube IFrame issues; options: host a silent proxy, use a different lofi source like a free stream, or add `origin` handling + user gesture guards. For now, add try/catch around YT init and a note in UI).
c. Timer stopwatch: replace the hacky formula + logic with proper elapsed counter (use a separate `elapsed` state that only increments when mode==='stopwatch').
d. Any other console warnings from build or previous user reports (Pixi interaction deprecation is already noted in history; check current build).
**Plan steps:**
1. Fix one subtask at a time in frontend (Timer.tsx, useConversation.ts, MusicPlayer.tsx).
2. Rebuild frontend after changes.
3. Verify no new errors in `npm run build`.
**Success criteria:** Clean console on load + mic use. Stopwatch actually counts up.
**Dependencies:** None (frontend-only).
## 8. fix-welcome-nesting (LOW)
**Status:** When saved ID exists, WelcomeScreen (full min-h-screen) is rendered inside a card div.
**Plan steps:**
1. In `App.tsx`: adjust the conditional rendering for WelcomeScreen — either make it always full-bleed (portal or separate route-like) or simplify the "identify" flow so it doesn't nest.
2. Test both new-user and returning-user paths.
3. Build/deploy.
**Success criteria:** Welcome screen looks correct (full or nicely carded) for both paths. No layout breakage.
**Dependencies:** None.
## 9. audit-followup (FINAL)
**Plan steps:**
1. After all above: full rebuild + deploy to 172.20.1.204.
2. SSH: `docker compose logs --tail 20`, health check, git status on server.
3. Update `AUDIT.md` with "Fixed" sections + remaining known issues.
4. Update this PLAN.md with completion notes.
5. Update todo list to all completed.
6. Provide user with summary + "hard refresh and test mic" instructions.
7. Optional: add a simple test script or note for future.
**Success criteria:** Live site has working (or clearly documented fallback) voice + all spec items visible. No critical items from original audit left open.
**Dependencies:** All previous.
---
## Execution Order & Parallelism
- **Phase 1 (Voice Core — do sequentially):** 1 → 2 → 3
- **Phase 2 (UI Completeness):** 4, 5 (can start after 4)
- **Phase 3 (Cleanup & Polish — parallelizable):** 6, 7, 8
- **Phase 4:** 9 (only after all)
**Estimated effort:** 1-2 hours of focused tool use + deploys (real time longer due to builds/SSH).
**Verification at each step:**
- `cd frontend && npm run build`
- `cd backend && python -c "import main; from services... import ..."`
- `git add -A && git commit -m "..." && git push`
- `ssh root@172.20.1.204 "cd /opt/kira && git pull && docker compose up -d --build 2>&1 | tail -5"`
- `ssh ... "docker logs kira-backend --tail 15 | grep -E 'whisper|STT|error|ready'" `
- Health: `curl http://localhost:8000/api/health` on server
- For frontend changes: note that nginx serves the new dist after rebuild.
**Rollback:** Every commit is atomic. Server has restart policy. Keep a "before" tag if needed.
**Post-plan:** After 9, offer to save this workflow as a skill if it proves reusable.
Start execution now. First action: tackle item 1 (fix-realtime-stt) by correcting the model name.
+4
View File
@@ -141,6 +141,10 @@ async def conversation_ws(websocket: WebSocket):
async def on_error(msg: str):
logger.warning(f"Whisper error: {msg}")
try:
await websocket.send_json({"type": "error", "message": f"Transcription error: {msg}"})
except Exception:
pass
# Start WhisperStream
stream = WhisperStream(
+26 -12
View File
@@ -41,7 +41,7 @@ class WhisperStream:
try:
import websockets
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-mini-realtime-preview"
url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-whisper"
ws = await websockets.connect(
url,
additional_headers={
@@ -58,7 +58,6 @@ class WhisperStream:
await self._send({
"type": "session.update",
"session": {
"modalities": ["text"], # no audio output
"input_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-realtime-whisper",
@@ -98,26 +97,41 @@ class WhisperStream:
if et == "input_audio_buffer.speech_started":
self._transcript = ""
logger.debug("speech_started")
elif et in ("conversation.item.input_audio_transcription.delta", "input_audio_buffer.transcription.delta"):
# Partial streaming transcript
delta = data.get("delta", "") or data.get("transcript", "")
if delta:
self._transcript = delta # or append if cumulative
await self._on_delta(delta)
elif et in ("conversation.item.input_audio_transcription.completed", "input_audio_buffer.transcription.completed", "conversation.item.created"):
# Final or item created with transcript
item = data.get("item", {})
content = item.get("content", []) if item else []
transcript = data.get("transcript", "")
if not transcript:
for part in (content or []):
if part.get("type") in ("transcript", "text"):
transcript = part.get("transcript", "") or part.get("text", "")
break
if transcript:
self._transcript = transcript
await self._on_delta(transcript)
await self._on_done(transcript.strip())
self._transcript = ""
elif et == "input_audio_buffer.speech_stopped":
if self._transcript.strip():
await self._on_done(self._transcript.strip())
self._transcript = ""
elif et == "conversation.item.created":
item = data.get("item", {})
content = item.get("content", [])
for part in (content or []):
pt = part.get("type", "")
txt = part.get("transcript", "") or part.get("text", "")
if pt == "transcript" and txt:
self._transcript = txt
await self._on_delta(txt)
elif et == "error":
err = data.get("error", {})
msg = err.get("message", str(data))
logger.warning(f"Whisper error: {msg}")
await self._on_error(msg)
async def send_audio(self, pcm16_bytes: bytes):
if not self._connected: