- Sequential next_tool_index is now used for both Responses API 'function_call' events and the proxy's 'tool_uses' JSON extraction.
- This ensures tool_calls arrays in the stream always start at index 0 and are dense, even if standard and embedded calls were somehow mixed.
- Fixed 'payload_idx' logic to correctly align argument chunks with their initialization chunks.
- The OpenAI Responses API uses 'output_index' to identify items in the response.
- If a response starts with text (output_index 0) followed by a tool call (output_index 1), the standard Chat Completions streaming format requires the first tool call to have index 0.
- Previously, the proxy was passing output_index (1) as the tool_call index, causing client-side SDKs to fail parsing the stream and silently drop the tool calls.
- Implemented a local mapping within the stream to ensure tool_call indexes are always dense and start at 0.
- Standard OpenAI clients expect tool_calls to be streamed as two parts:
1. Initialization chunk containing 'id', 'type', and 'name', with empty 'arguments'.
2. Payload chunk(s) containing 'arguments', with 'id', 'type', and 'name' omitted.
- Previously, the proxy was yielding all fields in a single chunk when parsing the custom 'tool_uses' JSON from gpt-5.4, causing strict clients like opencode to fail silently when delegating parallel tasks.
- The proxy now splits the extracted JSON into the correct two-chunk sequence, restoring subagent compatibility.
- Remove overzealous .trim() in strip_internal_metadata which destroyed whitespace between text stream chunks, causing client hangs
- Fix finish_reason logic to only yield once at the end of the stream
- Correctly yield finish_reason: 'tool_calls' instead of 'stop' when tool calls are generated
- The Responses API does not use 'response.item.delta' for tool calls.
- It uses 'response.output_item.added' to initialize the function call.
- It uses 'response.function_call_arguments.delta' for the payload stream.
- Updated the streaming parser to catch these events and correctly yield ToolCallDelta objects.
- This restores proper streaming of tool calls back to the client.
- The Responses API ends streams without a final '[DONE]' message.
- This causes reqwest_eventsource to return Error::StreamEnded.
- Previously, this was treated as a premature termination, triggering an error probe.
- We now explicitly match and break on Err(StreamEnded) for normal completion.
- The OpenAI Responses API actually requires the 'stream: true'
parameter in the JSON body, contrary to some documentation summaries.
- Omitting it caused the API to return a full application/json
response instead of SSE text/event-stream, leading to stream failures
and probe warnings in the proxy logs.
- Implement final buffer flush in streaming path to prevent data loss
- Increase probe response body logging to 500 characters
- Ensure internal metadata is stripped even on final flush
- Fix potential hang when stream ends without explicit [DONE] event
- Add strip_internal_metadata helper to remove prefixes like 'to=multi_tool_use.parallel'
- Clean up Thai text preambles reported in the journal
- Apply metadata stripping to both synchronous and streaming response paths
- Improve visual quality of proxied model responses
- Implement cross-delta content buffering in streaming Responses API
- Wait for full 'tool_uses' JSON block before yielding to client
- Handle 'to=multi_tool_use.parallel' preamble by buffering
- Fix stream error probe to not request a new stream
- Remove raw JSON leakage from streaming content
- Add static parse_tool_uses_json helper to extract embedded tool calls
- Update synchronous and streaming Responses API parsers to detect tool_uses blocks
- Strip tool_uses JSON from content to prevent raw JSON leakage to client
- Resolve lifetime issues by avoiding &self capture in streaming closure
- Add 'tools' and 'tool_choice' parameters to streaming Responses API
- Include 'name' field in message items for Responses API input
- Use string content for text-only messages to improve instruction following
- Fix subagents not triggering and files not being created
Updated OpenAI Responses API to use a structured input format (array of objects) for better compatibility. Added a proactive error probe to chat_responses_stream to capture and log API error bodies on failure.
This commit adds support for the OpenAI Responses API in both streaming and non-streaming modes. It also implements proactive routing for gpt-5 and codex models and cleans up unused 'session' variable warnings across the dashboard source files.
Newer OpenAI models (o1, o3, gpt-5) have deprecated 'max_tokens' in favor of
'max_completion_tokens'. The provider now automatically maps this parameter
to ensure compatibility and avoid 400 errors.
Track cache_read_tokens and cache_write_tokens end-to-end: parse from
provider responses (OpenAI, DeepSeek, Grok, Gemini), persist to SQLite,
apply cache-aware pricing from the model registry, and surface in API
responses and the dashboard.
- Add cache fields to ProviderResponse, StreamUsage, RequestLog structs
- Parse cached_tokens (OpenAI/Grok), prompt_cache_hit/miss (DeepSeek),
cachedContentTokenCount (Gemini) from provider responses
- Send stream_options.include_usage for streaming; capture real usage
from final SSE chunk in AggregatingStream
- ALTER TABLE migration for cache_read_tokens/cache_write_tokens columns
- Cache-aware cost formula using registry cache_read/cache_write rates
- Update Provider trait calculate_cost signature across all providers
- Add cache_read_tokens/cache_write_tokens to Usage API response
- Dashboard: cache hit rate card, cache columns in pricing and usage
tables, cache token aggregation in SQL queries
- Remove API debug panel and verbose console logging from api.js
- Bump static asset cache-bust to v5
- Add in-memory ModelConfigCache (30s refresh, explicit invalidation)
replacing 2 SQLite queries per request (model lookup + cost override)
- Configure all 5 provider HTTP clients with proper timeouts (300s),
connection pooling (4 idle/host, 90s idle timeout), and TCP keepalive
- Move client_usage update to tokio::spawn in non-streaming path
- Use fast chars/4 heuristic for token estimation on large inputs (>1KB)
- Generate single UUID/timestamp per SSE stream instead of per chunk
- Add shared LazyLock<Client> for image fetching in multimodal module
- Add proxy overhead timing instrumentation for both request paths
- Fix test helper to include new model_config_cache field
Implement full OpenAI-compatible tool-calling support across the proxy,
enabling OpenCode to use llm-proxy as its sole LLM backend.
- Add 9 tool-calling types (Tool, FunctionDef, ToolChoice, ToolCall, etc.)
- Update ChatCompletionRequest/ChatMessage/ChatStreamDelta with tool fields
- Update UnifiedRequest/UnifiedMessage to carry tool data through the pipeline
- Shared helpers: messages_to_openai_json handles tool messages, build_openai_body
includes tools/tool_choice, parse/stream extract tool_calls from responses
- Gemini: full OpenAI<->Gemini format translation (functionDeclarations,
functionCall/functionResponse, synthetic call IDs, tool_config mapping)
- Gemini: extract duplicated message-conversion into shared convert_messages()
- Server: SSE streams include tool_calls deltas, finish_reason='tool_calls'
- AggregatingStream: accumulate tool call deltas across stream chunks
- OpenAI provider: add o4- prefix to supports_model()
- Added 'provider_configs' and 'model_configs' tables to database.
- Refactored ProviderManager to support thread-safe dynamic updates and database overrides.
- Implemented 'Models' tab in dashboard to manage model visibility, mapping, and pricing.
- Added provider configuration modal to 'Providers' tab.
- Integrated database overrides into chat completion logic (enabled state, mapping, and cost).