Commit Graph

84 Commits

Author SHA1 Message Date
hobokenchicken 1bfc8333e9 clean: archive legacy stt/tts/llm services; update ARCHITECTURE.md + README.md to current stack (REST gpt-4o-transcribe, nano, sage, Honcho, incremental TTS, white noise)
Per PLAN item 6. Legacy files moved to archive/legacy-pipeline/.
2026-06-04 15:54:12 -04:00
hobokenchicken 59b72aa184 feat(white-noise): add Web Audio generated white/pink/brown/rain/cafe noise player
Separate from lofi music per original spec. Toggleable, volume control, always available in focus column.
Finishes item 5.
2026-06-04 15:49:18 -04:00
hobokenchicken 3f1497174d feat(ui): integrate Notes component into main grid (was dead import)
Per original spec and AUDIT/PLAN item 4. Notes now visible alongside Timer/Music.
2026-06-04 15:38:52 -04:00
hobokenchicken 771c00830a feat(ui): display livePartial / transcript_delta in ChatBubble as 'Hearing:' indicator
- REST STT now sends delta (full text) so UI lights up immediately with what was heard.
- Works as 'live' for the final transcript (true partials would stream words if Realtime was available).
- Per PLAN item 3.
2026-06-04 15:34:37 -04:00
hobokenchicken 77cbd91b93 fix(tts): play Opus chunks immediately as they arrive instead of buffering until speaking_end
This makes the voice start playing the first words while the rest of the response is still generating (big win for perceived latency).
Per PLAN item 2.
2026-06-04 15:28:40 -04:00
hobokenchicken 0e74a16b40 fix(stt): revert to reliable REST gpt-4o-transcribe + MediaRecorder full-blob (Realtime WS not accessible on key)
- Backend: added transcribe_audio (gpt-4o-transcribe), switched audio handler to full blob -> REST -> LLM -> streaming TTS
- Frontend: MediaRecorder (webm/opus) full recording sent on stop (one blob per utterance)
- Removed dead WhisperStream callbacks and pending_transcript/lock
- This unblocks voice per AUDIT item 1 (Option B fallback). Deltas will come in later item.
- Also preps for deprecation fix (MediaRecorder is the good path).
2026-06-04 15:23:57 -04:00
hobokenchicken 188da1d52a fix(stt): try gpt-4o-realtime-preview as base session model + gpt-realtime-whisper for input_audio_transcription (per OpenAI error guidance) 2026-06-04 15:16:08 -04:00
hobokenchicken 191b7ad9b5 fix(stt): correct Realtime WS model to gpt-realtime-whisper + enhance event handling for deltas/completed
- URL now uses ?model=gpt-realtime-whisper (was invalid gpt-4o-mini-realtime-preview)
- Cleaned session.update (removed modalities that may not apply)
- Expanded _handle to catch input_audio_transcription.delta and .completed events
- on_error now forwards transcription errors to frontend client
- Per AUDIT + PLAN item 1
2026-06-04 15:14:26 -04:00
hobokenchicken 7502f201c7 feat: Realtime WebSocket STT via gpt-realtime-whisper
Replaces REST-based transcription (gpt-4o-transcribe) with WebSocket
streaming via gpt-realtime-whisper. Frontend captures PCM16 audio and
streams it through the backend to a Realtime transcription session.

- Server-side VAD detects utterance boundaries automatically
- Word-level transcript deltas stream to the client in real-time
- On utterance end, gpt-5.4-nano generates a response
- TTS streams back via with_streaming_response
- Total pipeline: PCM16 → Realtime WS → LLM → streaming TTS
2026-06-04 14:26:19 -04:00
hobokenchicken 25b12ee14f fix: gpt-realtime-whisper requires Realtime API, not REST endpoint
Swapped to gpt-4o-transcribe (/usr/bin/bash.006/min) — middle ground between
speed and cost.
2026-06-04 14:23:41 -04:00
hobokenchicken 3128f69e48 fix: switch TTS voice from nova to sage 2026-06-04 14:22:18 -04:00
hobokenchicken f98f87b7ee fix: swap to gpt-realtime-whisper for STT
Replaces gpt-4o-mini-transcribe (/usr/bin/bash.003/min) with gpt-realtime-whisper
(/usr/bin/bash.017/min). Expected to reduce transcription latency from ~2.6s
to ~1s due to the model's realtime optimization.
2026-06-04 14:20:40 -04:00
hobokenchicken 9cd183a83b fix: streaming TTS via with_streaming_response
Replaced synchronous TTS (waiting for full audio at 5.9s) with
streaming TTS that sends audio chunks as they arrive. Backend now
accumulates chunks in audioBufferRef and plays the complete stream
on speaking_end. Reduces TTS latency from ~6s to ~1s first byte.
2026-06-04 14:17:54 -04:00
hobokenchicken 2cd5636ad6 debug: add per-step timing logs to identify latency bottleneck 2026-06-04 14:14:02 -04:00
hobokenchicken 7875b5d12a fix: cache Honcho memory context per-session (not per-turn)
The memory context was being rebuilt on every conversation turn via
build_system_prompt(), which calls Honcho's dialectic reasoning API
twice (get_user_context + get_kira_context). Each call takes 5-15s.

Now the memory suffix is computed ONCE during identify and cached in
a memory_suffix variable for the session duration. Per-turn latency
drops from ~37s to ~3s.

Also removed duplicated _pcm16_to_wav and cleaned up orphaned code.
2026-06-04 14:11:14 -04:00
hobokenchicken c5cc4dd480 fix: replace PCM16 capture with MediaRecorder (Opus/webm)
PCM16 capture via AudioContext was streaming raw audio continuously,
causing massive accumulated buffers that took ~20s to transcribe.
Replaced with MediaRecorder which records compressed Opus/webm and
sends a single blob on release — much smaller, faster to transcribe.

Also removed all unused PCM16/WAV helper functions from both frontend
and backend.
2026-06-04 14:04:44 -04:00
hobokenchicken 537ddcd841 fix: play Opus TTS audio directly instead of WAV-converting it
The backend sends Opus-encoded audio from OpenAI TTS (tts-1 with
response_format=opus). The frontend was treating it as raw PCM16
and wrapping it in a WAV container, which corrupted the audio into
static. Now plays the Opus data directly as audio/ogg.
2026-06-04 13:59:04 -04:00
hobokenchicken a370f1ebff fix: use max_completion_tokens for gpt-5.4-nano
gpt-5.4-nano uses max_completion_tokens instead of max_tokens
2026-06-04 13:57:05 -04:00
hobokenchicken a19ac46312 fix: wrap PCM16 in WAV container before STT API call
Frontend captures PCM16 mono 24kHz audio. The transcription API
expects a proper audio container format (wav, webm, etc.), not raw
PCM16 data. Added _pcm16_to_wav() to wrap the raw bytes in a WAV
header before sending to gpt-4o-mini-transcribe.
2026-06-04 13:55:05 -04:00
hobokenchicken f2a5416408 feat: cheapest pipeline — gpt-4o-mini-transcribe + gpt-5.4-nano + TTS
Simple 3-step chat completions pipeline at ~/usr/bin/bash.019/min total.
Streams PCM16 audio from frontend, transcribes on release,
generates response via gpt-5.4-nano, speaks via OpenAI TTS.

Cost breakdown:
  gpt-4o-mini-transcribe: /usr/bin/bash.003/min
  gpt-5.4-nano:          ~/usr/bin/bash.001/min
  OpenAI TTS (nova):     /usr/bin/bash.015/min
  Total:                 ~/usr/bin/bash.019/min (~/usr/bin/bash.57/day at 30min)
2026-06-04 13:51:35 -04:00
hobokenchicken 66e799a655 fix: connect directly via websockets to bypass OpenAI-Beta header
The openai library's beta.realtime.connect() hardcodes the obsolete
'OpenAI-Beta: realtime=v1' header which the GA API rejects. Connecting
directly via the websockets library with only the Authorization header
resolves the 'beta_api_shape_disabled' error.
2026-06-04 13:49:56 -04:00
hobokenchicken 274d04ea10 feat: hybrid pipeline — gpt-realtime-whisper + gpt-5.4-nano + TTS
Hybrid approach gives streaming STT at ~/usr/bin/bash.017/min + cheap brain
at ~/usr/bin/bash.001/min + TTS at ~/usr/bin/bash.015/min = ~/usr/bin/bash.033/min total.

- gpt-realtime-whisper handles streaming transcription with VAD
- gpt-5.4-nano handles response generation (chat completions)
- OpenAI TTS (nova) for voice output
- Server VAD detects utterance boundaries
- Honcho memory context injected into system prompt
- Removed old full Realtime relay service
2026-06-04 13:48:06 -04:00
hobokenchicken 1c15d42e06 fix: remove OpenAI-Beta header, use gpt-realtime-2 GA model
The old OpenAI-Beta: realtime=v1 header is rejected by the GA API.
Removing it via extra_headers override. Using gpt-realtime-2 which
is the current production Realtime model.
2026-06-04 13:42:06 -04:00
hobokenchicken e2332af8d0 feat: OpenAI Realtime API pipeline
Replaced the 3-step sequential pipeline (Whisper STT → DeepSeek LLM
→ OpenAI TTS) with a single OpenAI Realtime API WebSocket using
gpt-4o-mini-realtime-preview.

- ~300-800ms latency vs 1-3s
- Server VAD for automatic turn detection
- Streaming audio chunks during playback
- Interruptions: user can speak over Kira mid-response
- Honcho memory still injected into session instructions
- Frontend captures PCM16 mono 24kHz via AudioContext
- Backend relays client ↔ OpenAI Realtime API
- Supports both voice (PCM16) and text input
2026-06-04 13:32:39 -04:00
hobokenchicken e64698b0ab fix: graceful mic-unavailable handling over HTTP
navigator.mediaDevices.getUserMedia() requires a secure context
(HTTPS or localhost). When accessed over plain HTTP, the API
is undefined. Now shows a friendly chat message instead of a
cryptic TypeError in the console.
2026-06-04 12:12:07 -04:00
hobokenchicken 895fb9ac0b fix: Live2D Ticker registration + outfit texture swap path
- Registered pixi Ticker via (Live2DModel as any).registerTicker()
  to fix 'No Ticker registered' warning and animation issues
- Fixed outfit texture swap: textures live on model.textures[]
  not model.internalModel.textures[]
2026-06-04 12:10:20 -04:00
hobokenchicken 3d3df64d7c fix: missing mic toggle in Live2D view + YouTube autoplay
KiraAvatar: Added Talk mic button to Live2D view (was only in
AnimatedAvatar fallback). Includes listening-pulse animation.

MusicPlayer: Replaced hidden YouTube iframe with proper IFrame
Player API. Now starts on explicit user click (Start Lo-Fi button),
complying with browser autoplay policies. Supports station
switching and volume control after playback starts.
2026-06-04 12:06:16 -04:00
hobokenchicken bee428ae0c fix: outfit texture swap via internalModel.textures array
model.internalModel.coreModel.setTexture() expects a raw WebGL
texture, not a PixiJS Texture. Instead, set the new PixiJS Texture
directly on the model's internalModel.textures[2] array. The render
loop's bindTexture() call extracts the WebGL handle from the PixiJS
BaseTexture and passes it to the Cubism core.

This eliminates the cascade of try-catch fallbacks and the
'coreModel.setTexture is not a function' TypeError.
2026-06-04 12:02:48 -04:00
hobokenchicken 0a6946b580 fix: pixi v7 isInteractive TypeError + outfit texture swap
- Added isInteractive() stub on Live2DModel to prevent
  'e.isInteractive is not a function' errors in pixi v7
- Swapped outfit texture loading to use Assets.load() with
  cascading fallbacks (model.setTexture -> internalModel -> coreModel)
- Removed unused Texture import
2026-06-04 11:57:33 -04:00
hobokenchicken d519258942 fix: TypeScript build errors in Docker
- useRef(null) initial values for Strict TS 6.0
- Removed deprecated transparent option in pixi v7
- Cast Live2DModel to any for addChild type compat
- Cast coreModel for getTexture access
- Fixed types/index.ts import path to scenes
- Added vite-env.d.ts with CSS module declarations
- Added null-coalesce for clearInterval calls
2026-06-04 11:46:58 -04:00
hobokenchicken b7edf6a82d feat: Live2D outfit textures + expression system + canvas tweaks
- Generated 5 outfit texture variants via HSL recolor (saved skin tones)
- Dynamic texture_02 swapping when outfit changes
- Expression buttons (Normal, Smile, Sad, Angry, Surprised, Blushing)
- Random idle expression changes every 8-15s
- Responsive canvas sizing with devicePixelRatio support
- Outfit generation script in scripts/gen_outfits.py
- Smoother lip-sync with phase-based mouth animation
2026-06-04 11:40:10 -04:00
hobokenchicken 9653f80abd feat: Live2D model integration with pixi-live2d-display
- Added Epsilon Live2D model (Cubism 4) with full motion/expression set
- KiraAvatar now loads Live2D via PixiJS + cubism4 renderer
- Idle animation auto-plays on load
- Lip-sync: PARAM_MOUTH_OPEN_Y driven by speaking state
- 8 expressions (Normal, Smile, Sad, Angry, Surprised, Blushing, f01, f02)
- 15 motion files including idle, tap, flick, shake
- Physics, eye blink, and LipSync parameter groups configured
- Falls back to animated SVG placeholder if model isn't available
2026-06-04 11:34:59 -04:00
hobokenchicken 78ea059f08 feat: user personalization with Honcho-backed preferences
- WelcomeScreen: first-time name entry with cute onboarding
- identify WS message: sets user_id, loads saved prefs from Honcho
- set_preference WS message: saves scene/outfit/accessory to Honcho metadata
- Preferences auto-load on return visits via localStorage + Honcho peer meta
- Kira uses the user's name in greeting and prompts
- Backend: get/set preference methods in KiraMemory service
- Frontend: optimistic preference updates, synced to backend on change
2026-06-04 11:00:58 -04:00
hobokenchicken 97424cb98f init: Kira — AI body double with Honcho memory
Full voice pipeline (Whisper STT -> DeepSeek LLM -> OpenAI TTS),
animated SVG avatar (Live2D-ready), girly-pop UI, lofi music,
timer/notes/pets/wardrobe widgets, 10 background scenes with
particle effects, Honcho cross-session memory.
2026-06-04 10:51:38 -04:00