hobokenchicken/kira

Author	SHA1	Message	Date
hobokenchicken	187c33637a	fix(honcho): wrap all Honcho calls in try/except (401 key issue)	2026-06-05 23:46:35 -04:00
hobokenchicken	2da52fbc3c	fix(gemini): AUDIO only modality (model doesn't support AUDIO+TEXT)	2026-06-05 23:45:33 -04:00
hobokenchicken	378cc153de	fix(gemini): correct WebSocket URL path (GenerativeService.BidiGenerateContent)	2026-06-05 23:40:50 -04:00
hobokenchicken	83a990e838	feat(audio): Gemini Live API replaces Whisper+GPT+ElevenLabs Single WebSocket proxy: frontend PCM16 16kHz → backend → Gemini Live API Gemini returns PCM16 24kHz audio + text. Playback via Web Audio API queue. Removed OpenAI/DeepSeek deps. Model: gemini-3.1-flash-live-preview. Voice: Aoede. Streaming bidirectional audio with silence gating.	2026-06-05 23:36:29 -04:00
hobokenchicken	92250a668b	fix: restore full WebSocket message loop in main.py (was truncated to 77 lines, missing the entire try/except message handler) The REST STT revert commit (`0e74a16`) deleted lines 78-262 including the message loop, identify handler, audio handler, text handler, and disconnect cleanup. This caused the WS to accept then immediately close, triggering a reconnect loop. Refactored for clarity: transcribe_audio(), get_kira_response(), stream_tts() as standalone async helpers. Full pipeline restored.	2026-06-05 02:10:41 -04:00
hobokenchicken	1bfc8333e9	clean: archive legacy stt/tts/llm services; update ARCHITECTURE.md + README.md to current stack (REST gpt-4o-transcribe, nano, sage, Honcho, incremental TTS, white noise) Per PLAN item 6. Legacy files moved to archive/legacy-pipeline/.	2026-06-04 15:54:12 -04:00
hobokenchicken	0e74a16b40	fix(stt): revert to reliable REST gpt-4o-transcribe + MediaRecorder full-blob (Realtime WS not accessible on key) - Backend: added transcribe_audio (gpt-4o-transcribe), switched audio handler to full blob -> REST -> LLM -> streaming TTS - Frontend: MediaRecorder (webm/opus) full recording sent on stop (one blob per utterance) - Removed dead WhisperStream callbacks and pending_transcript/lock - This unblocks voice per AUDIT item 1 (Option B fallback). Deltas will come in later item. - Also preps for deprecation fix (MediaRecorder is the good path).	2026-06-04 15:23:57 -04:00
hobokenchicken	188da1d52a	fix(stt): try gpt-4o-realtime-preview as base session model + gpt-realtime-whisper for input_audio_transcription (per OpenAI error guidance)	2026-06-04 15:16:08 -04:00
hobokenchicken	191b7ad9b5	fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed - URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview) - Cleaned session.update (removed modalities that may not apply) - Expanded _handle to catch input_audio_transcription.delta and .completed events - on_error now forwards transcription errors to frontend client - Per AUDIT + PLAN item 1	2026-06-04 15:14:26 -04:00
hobokenchicken	7502f201c7	feat: Realtime WebSocket STT via gpt-realtime-whisper Replaces REST-based transcription (gpt-4o-transcribe) with WebSocket streaming via gpt-realtime-whisper. Frontend captures PCM16 audio and streams it through the backend to a Realtime transcription session. - Server-side VAD detects utterance boundaries automatically - Word-level transcript deltas stream to the client in real-time - On utterance end, gpt-5.4-nano generates a response - TTS streams back via with_streaming_response - Total pipeline: PCM16 → Realtime WS → LLM → streaming TTS	2026-06-04 14:26:19 -04:00
hobokenchicken	25b12ee14f	fix: gpt-realtime-whisper requires Realtime API, not REST endpoint Swapped to gpt-4o-transcribe (/usr/bin/bash.006/min) — middle ground between speed and cost.	2026-06-04 14:23:41 -04:00
hobokenchicken	3128f69e48	fix: switch TTS voice from nova to sage	2026-06-04 14:22:18 -04:00
hobokenchicken	f98f87b7ee	fix: swap to gpt-realtime-whisper for STT Replaces gpt-4o-mini-transcribe (/usr/bin/bash.003/min) with gpt-realtime-whisper (/usr/bin/bash.017/min). Expected to reduce transcription latency from ~2.6s to ~1s due to the model's realtime optimization.	2026-06-04 14:20:40 -04:00
hobokenchicken	9cd183a83b	fix: streaming TTS via with_streaming_response Replaced synchronous TTS (waiting for full audio at 5.9s) with streaming TTS that sends audio chunks as they arrive. Backend now accumulates chunks in audioBufferRef and plays the complete stream on speaking_end. Reduces TTS latency from ~6s to ~1s first byte.	2026-06-04 14:17:54 -04:00
hobokenchicken	2cd5636ad6	debug: add per-step timing logs to identify latency bottleneck	2026-06-04 14:14:02 -04:00
hobokenchicken	7875b5d12a	fix: cache Honcho memory context per-session (not per-turn) The memory context was being rebuilt on every conversation turn via build_system_prompt(), which calls Honcho's dialectic reasoning API twice (get_user_context + get_kira_context). Each call takes 5-15s. Now the memory suffix is computed ONCE during identify and cached in a memory_suffix variable for the session duration. Per-turn latency drops from ~37s to ~3s. Also removed duplicated _pcm16_to_wav and cleaned up orphaned code.	2026-06-04 14:11:14 -04:00
hobokenchicken	c5cc4dd480	fix: replace PCM16 capture with MediaRecorder (Opus/webm) PCM16 capture via AudioContext was streaming raw audio continuously, causing massive accumulated buffers that took ~20s to transcribe. Replaced with MediaRecorder which records compressed Opus/webm and sends a single blob on release — much smaller, faster to transcribe. Also removed all unused PCM16/WAV helper functions from both frontend and backend.	2026-06-04 14:04:44 -04:00
hobokenchicken	a370f1ebff	fix: use max_completion_tokens for gpt-5.4-nano gpt-5.4-nano uses max_completion_tokens instead of max_tokens	2026-06-04 13:57:05 -04:00
hobokenchicken	a19ac46312	fix: wrap PCM16 in WAV container before STT API call Frontend captures PCM16 mono 24kHz audio. The transcription API expects a proper audio container format (wav, webm, etc.), not raw PCM16 data. Added _pcm16_to_wav() to wrap the raw bytes in a WAV header before sending to gpt-4o-mini-transcribe.	2026-06-04 13:55:05 -04:00
hobokenchicken	f2a5416408	feat: cheapest pipeline — gpt-4o-mini-transcribe + gpt-5.4-nano + TTS Simple 3-step chat completions pipeline at ~/usr/bin/bash.019/min total. Streams PCM16 audio from frontend, transcribes on release, generates response via gpt-5.4-nano, speaks via OpenAI TTS. Cost breakdown: gpt-4o-mini-transcribe: /usr/bin/bash.003/min gpt-5.4-nano: ~/usr/bin/bash.001/min OpenAI TTS (nova): /usr/bin/bash.015/min Total: ~/usr/bin/bash.019/min (~/usr/bin/bash.57/day at 30min)	2026-06-04 13:51:35 -04:00
hobokenchicken	66e799a655	fix: connect directly via websockets to bypass OpenAI-Beta header The openai library's beta.realtime.connect() hardcodes the obsolete 'OpenAI-Beta: realtime=v1' header which the GA API rejects. Connecting directly via the websockets library with only the Authorization header resolves the 'beta_api_shape_disabled' error.	2026-06-04 13:49:56 -04:00
hobokenchicken	274d04ea10	feat: hybrid pipeline — gpt-realtime-whisper + gpt-5.4-nano + TTS Hybrid approach gives streaming STT at ~/usr/bin/bash.017/min + cheap brain at ~/usr/bin/bash.001/min + TTS at ~/usr/bin/bash.015/min = ~/usr/bin/bash.033/min total. - gpt-realtime-whisper handles streaming transcription with VAD - gpt-5.4-nano handles response generation (chat completions) - OpenAI TTS (nova) for voice output - Server VAD detects utterance boundaries - Honcho memory context injected into system prompt - Removed old full Realtime relay service	2026-06-04 13:48:06 -04:00
hobokenchicken	1c15d42e06	fix: remove OpenAI-Beta header, use gpt-realtime-2 GA model The old OpenAI-Beta: realtime=v1 header is rejected by the GA API. Removing it via extra_headers override. Using gpt-realtime-2 which is the current production Realtime model.	2026-06-04 13:42:06 -04:00
hobokenchicken	e2332af8d0	feat: OpenAI Realtime API pipeline Replaced the 3-step sequential pipeline (Whisper STT → DeepSeek LLM → OpenAI TTS) with a single OpenAI Realtime API WebSocket using gpt-4o-mini-realtime-preview. - ~300-800ms latency vs 1-3s - Server VAD for automatic turn detection - Streaming audio chunks during playback - Interruptions: user can speak over Kira mid-response - Honcho memory still injected into session instructions - Frontend captures PCM16 mono 24kHz via AudioContext - Backend relays client ↔ OpenAI Realtime API - Supports both voice (PCM16) and text input	2026-06-04 13:32:39 -04:00
hobokenchicken	78ea059f08	feat: user personalization with Honcho-backed preferences - WelcomeScreen: first-time name entry with cute onboarding - identify WS message: sets user_id, loads saved prefs from Honcho - set_preference WS message: saves scene/outfit/accessory to Honcho metadata - Preferences auto-load on return visits via localStorage + Honcho peer meta - Kira uses the user's name in greeting and prompts - Backend: get/set preference methods in KiraMemory service - Frontend: optimistic preference updates, synced to backend on change	2026-06-04 11:00:58 -04:00
hobokenchicken	97424cb98f	init: Kira — AI body double with Honcho memory Full voice pipeline (Whisper STT -> DeepSeek LLM -> OpenAI TTS), animated SVG avatar (Live2D-ready), girly-pop UI, lofi music, timer/notes/pets/wardrobe widgets, 10 background scenes with particle effects, Honcho cross-session memory.	2026-06-04 10:51:38 -04:00

26 Commits