Files

T

hobokenchicken 1bfc8333e9 clean: archive legacy stt/tts/llm services; update ARCHITECTURE.md + README.md to current stack (REST gpt-4o-transcribe, nano, sage, Honcho, incremental TTS, white noise)

Per PLAN item 6. Legacy files moved to archive/legacy-pipeline/.

2026-06-04 15:54:12 -04:00

8.1 KiB

Raw Blame History

Kira — Architecture Plan

Overview

Browser-based real-time AI body double. She talks to Kira (microphone → STT → LLM → TTS → speaker), Kira talks back with lip-sync. Lo-fi from Lofi Girl streaming in the background, two cats hanging out, customizable background scenes, timers, notes, and a full wardrobe/accessory system.

Tech Stack

Layer	Choice	Why
Frontend	React 18 + Vite + TypeScript	Fast dev loop, runs in any browser
Styling	TailwindCSS + custom girly-pop theme	Pink/lavender/mint palette, rounded everything
Live2D	Cubism 4 Web SDK via pixi-live2d-display	Full lip-sync, gestures, blink, idle anims
Backend	Python FastAPI + WebSockets	Async-native, easy AI API integration
STT	OpenAI Whisper API	Best transcription, <$0.01/min
LLM	DeepSeek V4 (cloud API)	Smart, fast reasoning, good personality adherence
TTS	OpenAI TTS API	Clean female voice, low latency
Music	YouTube IFrame Player API (Lofi Girl channel)	Free, endless streams, no backend proxy
Audio Pipeline	MediaRecorder → chunks → WebSocket → backend	Low-latency real-time conversation
Infrastructure	Docker Compose + Caddy reverse proxy	Deployable in homelab immediately

Data Flow (Conversation)

┌─────────────────────────────────────────────────────────┐
│  Browser (React + Live2D + girly UI)                     │
│                                                          │
│  [Mic] → MediaRecorder (webm/opus full utterance)       │
│       ↓ (WebSocket /api/ws)                              │
│  [FastAPI Backend]                                       │
│       ↓                                                 │
│  1. REST gpt-4o-transcribe → full text (with delta emit)|
│  2. gpt-5.4-nano + Honcho memory suffix                 │
│  3. OpenAI TTS (sage) streaming response → Opus chunks  │
│       ↑ (WebSocket)                                      │
│  [Incremental Audio playback + Live2D + "Hearing" UI]   │
│                                                          │
│  White noise (Web Audio) and lofi run independently.     │
│  VAD is client-button based (toggle Talk).               │
└─────────────────────────────────────────────────────────┘

Current: REST STT for reliability (Realtime WS attempted but blocked by model access).

Directory Structure

ai-body-double/
├── docker-compose.yml        # Single compose for both services
├── .env.example              # API keys template
├── backend/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── main.py               # FastAPI app + WebSocket handler
│   ├── config.py             # Environment/API config
│   ├── routers/
│   │   ├── __init__.py
│   │   ├── conversation.py   # WS: mic → STT → LLM → TTS roundtrip
│   │   ├── tools.py          # REST: timers, notes, backgrounds
│   │   └── assets.py         # REST: outfit/pet state, backgrounds
│   ├── services/
│   │   ├── __init__.py
│   │   ├── memory.py         # Honcho wrapper (context, prefs, summaries)
│   │   └── whisper_stream.py # Realtime WS attempt (archived use; fallback to REST)
│   └── models/
│       ├── __init__.py
│       └── schemas.py        # Pydantic models
├── frontend/
│   ├── Dockerfile (nginx)
│   ├── package.json
│   ├── vite.config.ts
│   ├── tailwind.config.js
│   ├── index.html
│   ├── public/
│   │   └── live2d/           # Live2D model files
│   │       ├── kira.model3.json
│   │       ├── kira.moc3
│   │       └── textures/
│   └── src/
│       ├── main.tsx
│       ├── App.tsx
│       ├── api/
│       │   └── ws.ts         # WebSocket client
│       ├── components/
│       │   ├── KiraAvatar.tsx     # Live2D canvas wrapper
│       │   ├── BackgroundScene.tsx # Scene selector + overlay
│       │   ├── MusicPlayer.tsx    # Lofi Girl YouTube embed
│       │   ├── Timer.tsx          # Pomodoro + countdown
│       │   ├── Notes.tsx          # Quick notes widget
│       │   ├── Clock.tsx          # Digital clock
│       │   ├── PetZone.tsx        # Two cats 🐱🐱
│       │   ├── Toolbar.tsx        # Bottom nav
│       │   └── Wardrobe.tsx       # Outfit/accessory picker
│       ├── hooks/
│       │   ├── useAudio.ts        # Mic recording + playback
│       │   └── useKiraState.ts    # Shared state
│       ├── styles/
│       │   └── theme.css          # Girly-pop palette
│       └── types/
│           └── index.ts

Phase 1 Build Order

I'll build this in dependency order so each piece is testable:

Phase 1a — Skeleton (this session)

Project scaffold (frontend + backend + docker-compose)
Basic UI layout with girly-pop styling
Clock widget
Background scene selector (CSS gradient scenes)
Music player (Lofi Girl YouTube embed)
Timer widget (Pomodoro + countdown)

Phase 1b — Audio Pipeline

Backend: FastAPI + WebSocket handler
Backend: STT service (Whisper API)
Backend: LLM service (DeepSeek)
Backend: TTS service (OpenAI)
Frontend: Microphone recording → WebSocket
Frontend: Audio playback + conversation UI

Phase 1c — Kira the Avatar

Integrate Live2D SDK with placeholder model
Idle animations (blink, breath, random gestures)
Lip-sync from TTS audio
Wardrobe/outfit system

Phase 1d — Cats & Polish

Pet zone with two cats (CSS-animated sprites)
Notes widget
Color polish, transitions, responsive layout
Docker compose finalization + Caddy config

Key Design Decisions

Live2D Model

We'll use a free sample Live2D model (Hiyori or Mao from Cubism SDK samples) as a placeholder
Custom "Kira" model can be commissioned and swapped in later — the SDK integration is identical
Lip-sync driven by analyzing TTS audio amplitude (simplified: no phoneme mapping needed)

Background Scenes

CSS gradient + pattern scenes for Phase 1 (no image assets needed)
Scene types: Cozy Room (warm), Coffee Shop (browns), Garden (greens), Rainy Window (blues), Starry Night (dark purples), Sakura Spring (pinks)
Scene = animated CSS background + subtle particle overlay (rain, stars, petals)

The Two Cats

CSS/Canvas-animated sprites (not Live2D — overkill for cats)
Orange fluffy: larger, follows cursor sometimes, stretches out
Black shorthair: smaller, sleeps curled up, occasional tail twitch

Outfit System

Live2D models support model.json texture swapping
Phase 1: pre-defined color palettes/outfits that swap texture files
Outfits: Cozy Hoodie, Girly Dress, Pajama Set, Study Sweater, Going Out

Conversation Personality

System prompt defines: female, kind, encouraging, ADHD-aware, body double presence
Kira checks in: "How's it going? Need a timer? Want me to pick a scene?"
Encouragement: "You've got this! 15 minutes down." / "Time for a stretch break?"
Never judgmental. Always supportive.

Palette (Girly-Pop)

Primary bg:    #FFF5F5  (soft pink-white)
Accent pink:   #FFB6C1  (light pink)
Accent lav:    #D8B4FE  (lavender)
Accent mint:   #A7F3D0  (mint)
Text primary:  #4A1942  (deep plum)
Text soft:     #7C3AED  (violet)
Card bg:       #FFFFFF  (with soft shadow)
Highlight:     #FDF2F8  (pink glow)

Deployment

docker compose up -d

Frontend: http://kira.hobokenchicken.com (via Caddy)
Backend:  ws://kira.hobokenchicken.com/api/ws (WebSocket)

Runs on existing homelab stack. Add Caddy entry pointing to the frontend container.

8.1 KiB Raw Blame History