db5824f0fb699e11bb3542810b804bb4bc0c4dc8
Track cache_read_tokens and cache_write_tokens end-to-end: parse from provider responses (OpenAI, DeepSeek, Grok, Gemini), persist to SQLite, apply cache-aware pricing from the model registry, and surface in API responses and the dashboard. - Add cache fields to ProviderResponse, StreamUsage, RequestLog structs - Parse cached_tokens (OpenAI/Grok), prompt_cache_hit/miss (DeepSeek), cachedContentTokenCount (Gemini) from provider responses - Send stream_options.include_usage for streaming; capture real usage from final SSE chunk in AggregatingStream - ALTER TABLE migration for cache_read_tokens/cache_write_tokens columns - Cache-aware cost formula using registry cache_read/cache_write rates - Update Provider trait calculate_cost signature across all providers - Add cache_read_tokens/cache_write_tokens to Usage API response - Dashboard: cache hit rate card, cache columns in pricing and usage tables, cache token aggregation in SQL queries - Remove API debug panel and verbose console logging from api.js - Bump static asset cache-bust to v5
LLM Proxy Gateway
A unified, high-performance LLM proxy gateway built in Rust. It provides a single OpenAI-compatible API to access multiple providers (OpenAI, Gemini, DeepSeek, Grok) with built-in token tracking, real-time cost calculation, and a management dashboard.
🚀 Features
- Unified API: Fully OpenAI-compatible
/v1/chat/completionsendpoint. - Multi-Provider Support:
- OpenAI: Standard models (GPT-4o, GPT-3.5, etc.) and reasoning models (o1, o3).
- Google Gemini: Support for the latest Gemini 2.0 models.
- DeepSeek: High-performance, low-cost integration.
- xAI Grok: Integration for Grok-series models.
- Ollama: Support for local LLMs running on your machine or another host.
- Observability & Tracking:
- Real-time Costing: Fetches live pricing and context specs from
models.devon startup. - Token Counting: Precise estimation using
tiktoken-rs. - Database Logging: Every request is logged to SQLite for historical analysis.
- Streaming Support: Full SSE (Server-Sent Events) support with aggregated token tracking.
- Real-time Costing: Fetches live pricing and context specs from
- Multimodal (Vision): Support for image processing (Base64 and remote URLs) across compatible providers.
- Reliability:
- Circuit Breaking: Automatically protects your system when providers are down.
- Rate Limiting: Granular per-client and global rate limits.
- Management Dashboard: A modern, real-time web interface to monitor usage, costs, and system health.
🛠️ Tech Stack
- Runtime: Rust (2024 Edition) with Tokio.
- Web Framework: Axum.
- Database: SQLx with SQLite.
- Frontend: Vanilla JS/CSS (no heavyweight framework required).
🚦 Getting Started
Prerequisites
- Rust (1.80+)
- SQLite3
Installation
-
Clone the repository:
git clone https://github.com/yourusername/llm-proxy.git cd llm-proxy -
Set up your environment:
cp .env.example .env # Edit .env and add your API keys -
Configure providers and server: Edit
config.tomlto customize models, pricing fallbacks, and port settings.Ollama Example (config.toml):
[providers.ollama] enabled = true base_url = "http://192.168.1.50:11434/v1" models = ["llama3", "mistral"] -
Run the proxy:
cargo run --release
The server will start at http://localhost:3000 (by default).
📊 Management Dashboard
Access the built-in dashboard at http://localhost:3000 to see:
- Usage Summary: Total requests, tokens, and USD spent.
- Trend Charts: 24-hour request and cost distributions.
- Live Logs: Real-time stream of incoming LLM requests via WebSockets.
- Provider Health: Monitor which providers are online or degraded.
🔌 API Usage
The proxy is designed to be a drop-in replacement for OpenAI. Simply change your base URL:
Example Request (Python):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3000/v1",
api_key="your-proxy-client-id" # Hashed sk- keys are managed in the dashboard
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello from the proxy!"}]
)
⚖️ License
MIT OR Apache-2.0
Description
Languages
JavaScript
54.7%
Go
34.6%
CSS
8.3%
HTML
2.2%
Dockerfile
0.2%