fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed
- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview) - Cleaned session.update (removed modalities that may not apply) - Expanded _handle to catch input_audio_transcription.delta and .completed events - on_error now forwards transcription errors to frontend client - Per AUDIT + PLAN item 1
This commit is contained in:
@@ -0,0 +1,180 @@
|
||||
# Kira Project Audit — Broken / Wrong / Doesn't Work
|
||||
|
||||
**Date:** 2026-06-04 (current session state)
|
||||
**Scope:** Full code + running deployment review (~/Projects/ai-body-double + docker on 172.20.1.204)
|
||||
**Method:** File inspection, server logs, build verification, component cross-refs, console history from user reports.
|
||||
|
||||
---
|
||||
|
||||
## Critical (Core Voice Pipeline Broken)
|
||||
|
||||
1. **Realtime STT via gpt-realtime-whisper is non-functional**
|
||||
- `backend/services/whisper_stream.py:44`: WS URL hardcodes `?model=gpt-4o-mini-realtime-preview`
|
||||
- Server logs repeatedly: `invalid_request_error.model_not_found` + close code 4004
|
||||
- `Whisper error: The model `gpt-4o-mini-realtime-preview` does not exist or you do not have access to it.`
|
||||
- Result: Pressing "Talk to Kira" streams PCM16 but backend never processes → no transcript, no response. Equivalent to the old "Could not transcribe".
|
||||
- Root: OpenAI Realtime models require specific names (e.g. `gpt-4o-realtime-preview`); account may lack preview access or "gpt-realtime-whisper" transcription param is not the public way to do pure STT.
|
||||
- The session.update tries `"model": "gpt-realtime-whisper"` for input_audio_transcription — this may also be invalid.
|
||||
|
||||
2. **TTS audio playback is batched, not streaming to user**
|
||||
- Backend streams chunks over WS (`with_streaming_response`).
|
||||
- Frontend (`useConversation.ts:119-150`): pushes to `audioBufferRef`, **only combines + plays the Blob on `speaking_end`**.
|
||||
- User hears **nothing** until the entire TTS response is generated and transferred. Defeats the purpose of streaming TTS for low perceived latency.
|
||||
- Previous "33s delay" complaints were partly this + Honcho + full STT blob.
|
||||
|
||||
3. **No live transcript UI (transcript_delta ignored)**
|
||||
- Backend sends `transcript_delta` for word-by-word as she speaks (VAD on server).
|
||||
- Frontend `handleMessage`: `case 'transcript_delta': // Streaming partial... break;`
|
||||
- ChatBubble only shows final `transcript` messages. User has zero feedback on what the mic is hearing in real time.
|
||||
|
||||
---
|
||||
|
||||
## Missing / Incomplete Features (from original spec)
|
||||
|
||||
4. **Notes widget not in the UI**
|
||||
- `Notes.tsx` fully implemented (add/toggle/delete local notes).
|
||||
- `App.tsx:6` imports it but **never renders** it in the grid.
|
||||
- BackgroundScene also imported but unused (Toolbar is a limited replacement).
|
||||
|
||||
5. **No white noise / ambient sounds**
|
||||
- Spec explicitly called for "white noise".
|
||||
- Only lo-fi YouTube music (MusicPlayer). No rain, fan, cafe, brown noise, etc. toggles.
|
||||
|
||||
6. **"Focus tools" beyond basic timer are thin**
|
||||
- Timer exists (pomodoro/countdown/stopwatch) but stopwatch is broken (see below).
|
||||
- No task decomposition, "flocus" (focus?) mode, habit trackers, or integration with notes/chat.
|
||||
|
||||
7. **Pet is purely decorative**
|
||||
- `PetZone.tsx`: static SVG cats (Mochi + Luna). No feeding, petting, state, or interaction with focus sessions.
|
||||
|
||||
8. **Accessory system incomplete for Live2D**
|
||||
- Wardrobe sends accessory changes.
|
||||
- `KiraAvatar.tsx:196`: `// The model has hair clips... For now, the accessory system works on the fallback avatar`
|
||||
- Only AnimatedAvatar (SVG) respects accessories.
|
||||
|
||||
---
|
||||
|
||||
## Dead Code / Stale Architecture
|
||||
|
||||
9. **Legacy service files completely unused**
|
||||
- `backend/services/stt.py`, `tts.py`, `llm.py` exist and implement old REST pipeline (whisper-1 + DeepSeek + OpenAI TTS).
|
||||
- Current `main.py` has everything inlined + uses `whisper_stream.py`.
|
||||
- Confusing for future devs; requirements.txt still pulls old deps indirectly.
|
||||
- Docs still describe the old stack.
|
||||
|
||||
10. **Documentation is badly out of date**
|
||||
- `README.md`: "Whisper STT → DeepSeek → OpenAI Nova"
|
||||
- `ARCHITECTURE.md`: old data flow diagram with MediaRecorder + REST Whisper + DeepSeek.
|
||||
- `PROJECT_SCOPE.md`: planning questions, not current reality.
|
||||
- No mention of Honcho memory, Realtime WS, gpt-5.4-nano, Sage voice, streaming TTS.
|
||||
|
||||
11. **Stale variables and imports**
|
||||
- `backend/main.py`: `import time` (unused), `pending_transcript`, `transcript_lock` (declared, assigned in on_done but never read — leftover from aborted refactor).
|
||||
- `frontend/src/hooks/useConversation.ts`: `recorderRef` declared, cleaned up, but never assigned (MediaRecorder path was replaced by PCMCapture).
|
||||
- `App.tsx`: unused imports for `BackgroundScene` and `Notes`.
|
||||
|
||||
---
|
||||
|
||||
## UX / Reliability / Console Problems
|
||||
|
||||
12. **Deprecated audio capture**
|
||||
- `startPCMCapture` uses `ScriptProcessorNode` (deprecated since 2019, console warning on every mic use).
|
||||
- Should be AudioWorkletNode for production.
|
||||
|
||||
13. **YouTube music player is flaky and spammy**
|
||||
- Repeated console errors:
|
||||
- `Failed to execute 'postMessage' on 'DOMWindow': The target origin provided ('https://www.youtube.com') does not match...`
|
||||
- `Access to XMLHttpRequest at 'https://googleads.g.doubleclick.net...' blocked by CORS`
|
||||
- `net::ERR_FAILED 503`
|
||||
- Music often requires user gesture; ads/CSP can break playback or fill console.
|
||||
|
||||
14. **Timer stopwatch mode is broken**
|
||||
- `Timer.tsx:115`: insane display formula `formatTime(Math.floor((presetMinutes * 60 - remaining + presetMinutes * 60) || remaining))`
|
||||
- Interval always does `r - 1`; starting at 0 for stopwatch immediately hits <=1 and stops.
|
||||
- No real count-up logic.
|
||||
|
||||
15. **WelcomeScreen rendering bug**
|
||||
- When saved user-id exists but not yet identified: renders a full `min-h-screen` WelcomeScreen **inside** a small glass-card. CSS collision, bad layout.
|
||||
|
||||
16. **Poor error visibility for mic/STT failures**
|
||||
- Mic errors set `micError` state but it's only used in one place? (mostly console).
|
||||
- STT connection failures produce no user-facing message — just silent "listening" that does nothing.
|
||||
- `handleMessage` for 'error' only does `console.error`.
|
||||
|
||||
17. **Live2D vs fallback inconsistency**
|
||||
- Attempts to load `kira.model3.json` (which proxies Epsilon model + custom textures).
|
||||
- Outfit swap code runs but is fragile (`(model as any).textures[2] = ...`).
|
||||
- Many users will see the SVG fallback (or broken Live2D).
|
||||
- Cubism deprecation warnings on load.
|
||||
|
||||
18. **No real interruption / barge-in**
|
||||
- User can keep talking while Kira speaks; no client "stop" signal.
|
||||
- No handling of `interruption` type on backend.
|
||||
|
||||
19. **Continuous streaming while "listening"**
|
||||
- Toggle starts permanent PCM16 stream to Realtime WS (cost + bandwidth).
|
||||
- Server VAD fires `on_done` on silences → Kira can interject multiple times unprompted.
|
||||
|
||||
---
|
||||
|
||||
## Performance / Cost / Infra
|
||||
|
||||
20. **LLM → TTS not overlapped**
|
||||
- In `on_done` and text path: full `await chat.completions` before starting TTS stream.
|
||||
- Even with fast nano, adds 1-2s serial delay before first audio chunk.
|
||||
|
||||
21. **Honcho calls are expensive/slow**
|
||||
- `build_system_prompt_suffix` (called once at identify) does two `peer.chat(...)` which are LLM calls inside Honcho.
|
||||
- Previously caused 20-30s delays when done per-turn; now cached but still slow on first load and burns tokens.
|
||||
|
||||
22. **Model access / pricing mismatch**
|
||||
- Attempting Realtime WS for cheaper STT but the required preview models are either unavailable or more expensive than documented REST transcription.
|
||||
- Current fallback path (if it worked) would be gpt-4o-transcribe + nano + sage.
|
||||
|
||||
23. **No health / readiness for the whisper sub-connection**
|
||||
- Main `/api/health` says "ok" even when whisper stream is dead.
|
||||
|
||||
---
|
||||
|
||||
## Deployment / Dev Experience
|
||||
|
||||
24. **Hardcoded model strings everywhere**
|
||||
- Changing STT/LLM/TTS requires edits in multiple places + no central config.
|
||||
|
||||
25. **No .env validation or example for Realtime keys**
|
||||
- .env.example exists but current stack needs only OPENAI + HONCHO (DeepSeek keys are vestigial).
|
||||
|
||||
26. **Build produces large bundles**
|
||||
- Live2D cubism + pixi ~300k gzipped; no tree-shaking notes.
|
||||
|
||||
27. **Local dev vs prod drift**
|
||||
- Vite proxy assumes `backend` hostname (docker).
|
||||
- No easy `npm run dev` + separate backend story documented.
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Prioritized)
|
||||
|
||||
1. Fix Realtime model name + test the WS transcription session (or fall back to REST `gpt-4o-transcribe` + streaming deltas via another mechanism).
|
||||
2. Make TTS playback incremental (use AudioContext buffer append or successive <audio> elements / MediaSource).
|
||||
3. Wire `transcript_delta` into ChatBubble as a "Kira is hearing..." live line or user speech preview.
|
||||
4. Delete or archive legacy stt/tts/llm.py and update all docs.
|
||||
5. Render `<Notes />` in the grid (replace one column or add a 5th).
|
||||
6. Add a simple white noise player (HTMLAudio with free CC0 tracks or oscillator).
|
||||
7. Replace ScriptProcessor with AudioWorklet (or at least suppress the deprecation for now).
|
||||
8. Fix stopwatch timer logic properly (separate elapsed state).
|
||||
9. Clean unused imports + recorderRef.
|
||||
10. Add basic user-facing error banner for mic/STT failures.
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
- **Short term**: Revert voice pipeline to the previous working (cheaper) REST version (`gpt-4o-transcribe` + nano + streaming TTS) until the Realtime transcription model name + access is sorted. The "pivot to Realtime WS" introduced more problems than it solved for latency.
|
||||
- **Medium**: Implement proper client VAD or use OpenAI's server VAD more cleanly; consider always-listening with push-to-interrupt.
|
||||
- **Long**: Either get proper Realtime API access or switch to a different low-latency STT (e.g. browser Web Speech API for free tier, or Deepgram/AssemblyAI).
|
||||
- Consider making the avatar always visible + voice always available rather than explicit "Talk" toggle for true body-double feel.
|
||||
|
||||
This audit is based on code + live server state as of the last successful deploy. Re-run after fixes.
|
||||
|
||||
Next steps: I can implement the quick wins (start with 1, 2, 3, 4) if you say the word. The old preserved todo list is now stale and should be replaced.
|
||||
@@ -0,0 +1,157 @@
|
||||
# Kira Body Double — Systematic Fix Plan (from AUDIT.md)
|
||||
|
||||
**Created:** 2026-06-04
|
||||
**Source:** Full audit of codebase, running deployment (172.20.1.204), server logs, build outputs, component cross-references, and user-reported symptoms (33s+ delays, static, "Could not transcribe", console errors).
|
||||
**Goal:** Turn the current non-functional voice pipeline + incomplete UI into a reliable, low-latency body-double experience matching the original specs (realtime voice, lo-fi + white noise, ADHD tools, avatar customization, notes, clock, pet, backgrounds).
|
||||
**Approach:** Prioritize by dependency and impact. Fix core voice first (STT + audio playback + feedback). Then UI completeness. Then cleanup + polish. Deploy + verify after every change where possible. Re-test end-to-end on the live site at the end.
|
||||
**Principles:**
|
||||
- Prefer minimal, targeted patches over big refactors.
|
||||
- Keep the cheap 3-step pipeline (or Realtime where it actually helps latency).
|
||||
- Verify with real tool output (builds, docker logs, git, SSH health checks).
|
||||
- Update docs and todo as we go.
|
||||
- If a subtask blocks, note it and move to parallelizable items.
|
||||
|
||||
**Total items:** 9 (matching the current todo list). Dependencies noted.
|
||||
|
||||
---
|
||||
|
||||
## 1. fix-realtime-stt (CRITICAL — blocks everything voice-related)
|
||||
**Status in audit:** Mic does nothing. WS connects then 4004s with `model_not_found`.
|
||||
**Root cause:** Wrong base model in WS URL (`gpt-4o-mini-realtime-preview`). `gpt-realtime-whisper` exists per OpenAI announcements but must be used as the `?model=` in the Realtime WS connect for transcription mode. Session config may also need tweaking for pure STT (no LLM responses from the realtime model).
|
||||
**Options:**
|
||||
A. Correct model + config for Realtime WS (preferred for true streaming partials + VAD).
|
||||
B. Revert to proven REST path (`gpt-4o-transcribe`) + simulate deltas if possible (cheaper, known working from history).
|
||||
**Plan steps:**
|
||||
1. Read current `backend/services/whisper_stream.py` and `backend/main.py` WS handling.
|
||||
2. Update WS URL to `?model=gpt-realtime-whisper` (and adjust session.update if needed — remove or set transcription model correctly).
|
||||
3. Add better error logging + client-facing error on WS failure.
|
||||
4. Rebuild/deploy.
|
||||
5. Verify: SSH docker logs for "Whisper stream ready" + no 4004/model_not_found. Simulate or note user test.
|
||||
6. If still fails after 1-2 attempts: implement Option B (switch main.py back to REST transcription + keep the streaming delta plumbing for future).
|
||||
**Risks:** OpenAI access tier for Realtime Whisper. Cost (Realtime is more expensive than REST transcription).
|
||||
**Success criteria:** Backend accepts audio chunks and emits `transcript_delta` + `on_done` / `transcript` without errors. No more "Could not transcribe".
|
||||
**Dependencies:** None (first).
|
||||
|
||||
## 2. streaming-audio-playback (HIGH — fixes perceived latency)
|
||||
**Status:** Chunks stream from backend but frontend only plays on `speaking_end`.
|
||||
**Plan steps:**
|
||||
1. Read `frontend/src/hooks/useConversation.ts` audio handling + `backend/main.py` synthesize_speech.
|
||||
2. Switch frontend to play chunks incrementally:
|
||||
- Option: Use a single Audio element with `srcObject` + MediaSource / SourceBuffer for Opus chunks (complex but true streaming).
|
||||
- Simpler reliable: Create new `<audio>` elements or use Web Audio API to decode/play each small chunk as it arrives (low latency for first words).
|
||||
- Keep lip-sync tied to `isKiraSpeaking`.
|
||||
3. Update backend to send smaller chunks or keep current streaming.
|
||||
4. Handle `speaking_end` only for final cleanup / state.
|
||||
5. Test build + deploy.
|
||||
6. Verify with logs + manual timing (first audio chunk arrives ~1-2s after LLM).
|
||||
**Success criteria:** User hears Kira's voice starting within ~1-2s of her finishing speaking (first words audible while rest is still generating).
|
||||
**Dependencies:** 1 (need working STT to test TTS flow).
|
||||
|
||||
## 3. live-transcript-ui (MEDIUM — UX for body-double feel)
|
||||
**Status:** Deltas arrive but are dropped.
|
||||
**Plan steps:**
|
||||
1. In `useConversation.ts`: expose live partial transcript state (append deltas, clear on `transcript` or `speaking_start`).
|
||||
2. Pass the live partial to `ChatBubble.tsx` (or a new "Hearing..." bubble).
|
||||
3. In ChatBubble: show a temporary "user is speaking: [live text]" or "Kira hears: ..." line with typing indicator while listening.
|
||||
4. Clear it when final transcript is processed or on error.
|
||||
5. Style girly-pop (pink bubble or inline).
|
||||
6. Build/deploy/verify.
|
||||
**Success criteria:** While holding Talk (or during server VAD), user sees real-time words appearing as she speaks.
|
||||
**Dependencies:** 1 (deltas only come from working STT).
|
||||
|
||||
## 4. integrate-notes (LOW — quick completeness win)
|
||||
**Status:** Component exists and is local-state only. Dead import in App.
|
||||
**Plan steps:**
|
||||
1. In `App.tsx`: import is already there — add `<Notes />` to the grid (e.g., replace or add alongside Timer or in a new row/column for balance).
|
||||
2. Consider making notes persistent (localStorage) so they survive refresh (optional quick win).
|
||||
3. Test layout on 1920x1080.
|
||||
4. Build/deploy.
|
||||
**Success criteria:** Notes panel visible and functional in the main dashboard.
|
||||
**Dependencies:** None.
|
||||
|
||||
## 5. add-white-noise (MEDIUM — matches original spec)
|
||||
**Status:** Spec: "white noise". Only YouTube lofi exists.
|
||||
**Plan steps:**
|
||||
1. Decide sources: free public-domain / CC0 ambient tracks (rain, cafe, fan, brown noise) hosted or use Web Audio oscillator + filters for simple generated noise (no external deps, reliable).
|
||||
2. Extend or new component `WhiteNoisePlayer.tsx` (similar UX to MusicPlayer: buttons for presets, volume, play/pause).
|
||||
3. Add to grid in `App.tsx` (perhaps replace or share space with MusicPlayer, or add a small ambient section).
|
||||
4. Integrate with scenes if possible (e.g., rain scene enables rain sound).
|
||||
5. Use `<audio>` elements with loop for tracks (host small mp3/wav in public/sounds or generate).
|
||||
6. For zero-asset version: use AudioContext + noise buffer (pink/brown noise generator).
|
||||
7. Build + deploy.
|
||||
**Success criteria:** User can toggle white noise independently of (or layered with) lofi. Matches "white noise" in original request.
|
||||
**Dependencies:** 4 (grid space).
|
||||
|
||||
## 6. clean-legacy-dead-code (MEDIUM — hygiene)
|
||||
**Status:** Old services + docs describe pre-Honcho, pre-nano, pre-Sage, pre-streaming world.
|
||||
**Plan steps:**
|
||||
1. Delete (or move to `archive/`) `backend/services/stt.py`, `tts.py`, `llm.py`.
|
||||
2. Update `backend/requirements.txt` if any unique deps were only for them.
|
||||
3. Rewrite sections in:
|
||||
- `README.md` (features, quick start, pipeline, costs).
|
||||
- `ARCHITECTURE.md` (data flow, stack table, current Realtime attempt).
|
||||
4. Update `PLAN.md` and `AUDIT.md` references if needed.
|
||||
5. Remove unused imports in `main.py` (time, pending_transcript, lock).
|
||||
6. Remove dead `recorderRef` in frontend hook.
|
||||
7. Commit + deploy (frontend/backend rebuild not strictly needed but do it).
|
||||
**Success criteria:** No more references to the old Whisper+DeepSeek+Nova pipeline. Codebase matches reality.
|
||||
**Dependencies:** Can be done in parallel with others.
|
||||
|
||||
## 7. fix-deprecations (MEDIUM — polish + warnings)
|
||||
**Subtasks (can batch):**
|
||||
a. ScriptProcessorNode → AudioWorkletNode (or at minimum wrap the deprecation in a try and log once).
|
||||
b. YouTube music: suppress or fix postMessage/CORS (common YouTube IFrame issues; options: host a silent proxy, use a different lofi source like a free stream, or add `origin` handling + user gesture guards. For now, add try/catch around YT init and a note in UI).
|
||||
c. Timer stopwatch: replace the hacky formula + logic with proper elapsed counter (use a separate `elapsed` state that only increments when mode==='stopwatch').
|
||||
d. Any other console warnings from build or previous user reports (Pixi interaction deprecation is already noted in history; check current build).
|
||||
**Plan steps:**
|
||||
1. Fix one subtask at a time in frontend (Timer.tsx, useConversation.ts, MusicPlayer.tsx).
|
||||
2. Rebuild frontend after changes.
|
||||
3. Verify no new errors in `npm run build`.
|
||||
**Success criteria:** Clean console on load + mic use. Stopwatch actually counts up.
|
||||
**Dependencies:** None (frontend-only).
|
||||
|
||||
## 8. fix-welcome-nesting (LOW)
|
||||
**Status:** When saved ID exists, WelcomeScreen (full min-h-screen) is rendered inside a card div.
|
||||
**Plan steps:**
|
||||
1. In `App.tsx`: adjust the conditional rendering for WelcomeScreen — either make it always full-bleed (portal or separate route-like) or simplify the "identify" flow so it doesn't nest.
|
||||
2. Test both new-user and returning-user paths.
|
||||
3. Build/deploy.
|
||||
**Success criteria:** Welcome screen looks correct (full or nicely carded) for both paths. No layout breakage.
|
||||
**Dependencies:** None.
|
||||
|
||||
## 9. audit-followup (FINAL)
|
||||
**Plan steps:**
|
||||
1. After all above: full rebuild + deploy to 172.20.1.204.
|
||||
2. SSH: `docker compose logs --tail 20`, health check, git status on server.
|
||||
3. Update `AUDIT.md` with "Fixed" sections + remaining known issues.
|
||||
4. Update this PLAN.md with completion notes.
|
||||
5. Update todo list to all completed.
|
||||
6. Provide user with summary + "hard refresh and test mic" instructions.
|
||||
7. Optional: add a simple test script or note for future.
|
||||
**Success criteria:** Live site has working (or clearly documented fallback) voice + all spec items visible. No critical items from original audit left open.
|
||||
**Dependencies:** All previous.
|
||||
|
||||
---
|
||||
|
||||
## Execution Order & Parallelism
|
||||
- **Phase 1 (Voice Core — do sequentially):** 1 → 2 → 3
|
||||
- **Phase 2 (UI Completeness):** 4, 5 (can start after 4)
|
||||
- **Phase 3 (Cleanup & Polish — parallelizable):** 6, 7, 8
|
||||
- **Phase 4:** 9 (only after all)
|
||||
|
||||
**Estimated effort:** 1-2 hours of focused tool use + deploys (real time longer due to builds/SSH).
|
||||
|
||||
**Verification at each step:**
|
||||
- `cd frontend && npm run build`
|
||||
- `cd backend && python -c "import main; from services... import ..."`
|
||||
- `git add -A && git commit -m "..." && git push`
|
||||
- `ssh root@172.20.1.204 "cd /opt/kira && git pull && docker compose up -d --build 2>&1 | tail -5"`
|
||||
- `ssh ... "docker logs kira-backend --tail 15 | grep -E 'whisper|STT|error|ready'" `
|
||||
- Health: `curl http://localhost:8000/api/health` on server
|
||||
- For frontend changes: note that nginx serves the new dist after rebuild.
|
||||
|
||||
**Rollback:** Every commit is atomic. Server has restart policy. Keep a "before" tag if needed.
|
||||
|
||||
**Post-plan:** After 9, offer to save this workflow as a skill if it proves reusable.
|
||||
|
||||
Start execution now. First action: tackle item 1 (fix-realtime-stt) by correcting the model name.
|
||||
@@ -141,6 +141,10 @@ async def conversation_ws(websocket: WebSocket):
|
||||
|
||||
async def on_error(msg: str):
|
||||
logger.warning(f"Whisper error: {msg}")
|
||||
try:
|
||||
await websocket.send_json({"type": "error", "message": f"Transcription error: {msg}"})
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Start WhisperStream
|
||||
stream = WhisperStream(
|
||||
|
||||
@@ -41,7 +41,7 @@ class WhisperStream:
|
||||
try:
|
||||
import websockets
|
||||
|
||||
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-mini-realtime-preview"
|
||||
url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-whisper"
|
||||
ws = await websockets.connect(
|
||||
url,
|
||||
additional_headers={
|
||||
@@ -58,7 +58,6 @@ class WhisperStream:
|
||||
await self._send({
|
||||
"type": "session.update",
|
||||
"session": {
|
||||
"modalities": ["text"], # no audio output
|
||||
"input_audio_format": "pcm16",
|
||||
"input_audio_transcription": {
|
||||
"model": "gpt-realtime-whisper",
|
||||
@@ -98,26 +97,41 @@ class WhisperStream:
|
||||
|
||||
if et == "input_audio_buffer.speech_started":
|
||||
self._transcript = ""
|
||||
logger.debug("speech_started")
|
||||
|
||||
elif et in ("conversation.item.input_audio_transcription.delta", "input_audio_buffer.transcription.delta"):
|
||||
# Partial streaming transcript
|
||||
delta = data.get("delta", "") or data.get("transcript", "")
|
||||
if delta:
|
||||
self._transcript = delta # or append if cumulative
|
||||
await self._on_delta(delta)
|
||||
|
||||
elif et in ("conversation.item.input_audio_transcription.completed", "input_audio_buffer.transcription.completed", "conversation.item.created"):
|
||||
# Final or item created with transcript
|
||||
item = data.get("item", {})
|
||||
content = item.get("content", []) if item else []
|
||||
transcript = data.get("transcript", "")
|
||||
if not transcript:
|
||||
for part in (content or []):
|
||||
if part.get("type") in ("transcript", "text"):
|
||||
transcript = part.get("transcript", "") or part.get("text", "")
|
||||
break
|
||||
if transcript:
|
||||
self._transcript = transcript
|
||||
await self._on_delta(transcript)
|
||||
await self._on_done(transcript.strip())
|
||||
self._transcript = ""
|
||||
|
||||
elif et == "input_audio_buffer.speech_stopped":
|
||||
if self._transcript.strip():
|
||||
await self._on_done(self._transcript.strip())
|
||||
self._transcript = ""
|
||||
|
||||
elif et == "conversation.item.created":
|
||||
item = data.get("item", {})
|
||||
content = item.get("content", [])
|
||||
for part in (content or []):
|
||||
pt = part.get("type", "")
|
||||
txt = part.get("transcript", "") or part.get("text", "")
|
||||
if pt == "transcript" and txt:
|
||||
self._transcript = txt
|
||||
await self._on_delta(txt)
|
||||
|
||||
elif et == "error":
|
||||
err = data.get("error", {})
|
||||
msg = err.get("message", str(data))
|
||||
logger.warning(f"Whisper error: {msg}")
|
||||
await self._on_error(msg)
|
||||
|
||||
async def send_audio(self, pcm16_bytes: bytes):
|
||||
if not self._connected:
|
||||
|
||||
Reference in New Issue
Block a user