Multi-Model API Gateway (LiteLLM + Ollama)
LiteLLM + Ollama = a unified API gateway serving local models and 100+ cloud providers. Load balance, track costs, set rate limits - all behind a single OpenAI-compatible endpoint.
Multi-Model API Gateway (LiteLLM + Ollama) is a local AI stack for Unify multiple local and cloud models behind a single OpenAI-compatible API. LiteLLM + Ollama = a unified API gateway serving local models and 100+ cloud providers. Load balance, track costs, set rate limits - all behind a single OpenAI-compatible endpoint. It combines 6 components, is rated advanced, and takes about 20 minutes to set up. Expect around $800 in hardware and $0/month versus cloud.
- Cost
- ~$800
- $0/mo vs cloud
- Difficulty
- advanced
- Setup time
- ~20 min
- Use case
- Unify multiple local and cloud models behind a single OpenAI-compatible API
~$800 hardware · $0/mo vs cloud
Multi-Model API Gateway (LiteLLM + Ollama)
A unified API gateway for all your AI models. LiteLLM is a Python SDK and proxy server that exposes 100+ LLM providers (OpenAI, Anthropic, Google, and local models) through a single OpenAI-compatible API. Connect it to Ollama for local inference, and you get a centralized gateway with load balancing, cost tracking, rate limiting, and fallback routing - all running on your own hardware.
What you get
- Unified API - one OpenAI-compatible endpoint for all models, local and cloud
- Load balancing - distribute requests across multiple instances of the same model
- Fallback routing - if one model fails, automatically try another
- Cost tracking - per-model, per-user, per-API-key spend tracking
- Rate limiting - control requests per second per user or API key
- Model access control - restrict which users can access which models
- Logging & auditing - full request/response logs with latency and token counts
- $0/mo - the gateway itself is free; pay only for cloud API usage if you use it
Architecture
| Component | Role |
|---|---|
| LiteLLM | Proxy server - routing, load balancing, cost tracking |
| Ollama | Serves local models for low-latency, private inference |
| Qwen3 30B A3B | Fast MoE model - 3B active params, great for most tasks |
| PostgreSQL (optional) | Persistent logging and spend tracking |
For local inference, recommended GPU: RTX 3090 24GB or RTX 4090 24GB. The gateway itself runs on any machine with Python.
Prerequisites
- Python 3.9+
- Ollama installed and running
- 2 GB RAM for the proxy (more for logging)
- Optional: PostgreSQL for persistent spend tracking
Setup
Step 1: Install LiteLLM
pip install 'litellm[proxy]'Step 2: Configure the Proxy
Create a config.yaml:
model_list:
# Local Ollama models
- model_name: qwen3-local
litellm_params:
model: ollama/qwen3:14b
api_base: http://localhost:11434
rpm: 30
- model_name: qwen3-moe-local
litellm_params:
model: ollama/qwen3:30b-a3b
api_base: http://localhost:11434
rpm: 20
- model_name: llama3-local
litellm_params:
model: ollama/llama3.1:8b
api_base: http://localhost:11434
rpm: 60
# Fallback cloud models (optional - add API keys)
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
drop_params: true
set_verbose: false
cache: true
cache_params:
type: redis
ttl: 3600
general_settings:
master_key: sk-your-master-key # Required for admin UI
database_url: os.environ/DATABASE_URL # Optional PostgreSQLStep 3: Start the Proxy
litellm --config config.yaml --port 4000The gateway is now running at http://localhost:4000 with an OpenAI-compatible API.
Step 4: Connect Any OpenAI-Compatible Tool
from openai import OpenAI
client = OpenAI(
api_key="sk-your-master-key",
base_url="http://localhost:4000/v1"
)
# Use local model
response = client.chat.completions.create(
model="qwen3-local",
messages=[{"role": "user", "content": "Hello!"}]
)Use it
Smart Fallback Routing
LiteLLM can automatically fall back to a different model if the primary one fails:
router_settings:
routing_strategy: latency-based-routing
fallbacks:
qwen3-local: [gpt-4o-mini, claude-sonnet]If your local Ollama goes down, requests automatically route to the cloud fallback.
Load Balancing Across GPU Instances
Run Ollama on multiple machines and balance across them:
- model_name: qwen3-local
litellm_params:
model: ollama/qwen3:14b
api_base: http://ollama-node-1:11434
- model_name: qwen3-local
litellm_params:
model: ollama/qwen3:14b
api_base: http://ollama-node-2:11434Track Spend by User
general_settings:
max_budget: 100 # $100 total budget
budget_duration: 30dAssign budgets to API keys or users in the LiteLLM admin UI at http://localhost:4000/admin.
Cost vs cloud
| Local LiteLLM + Ollama | Direct Cloud API Access | |
|---|---|---|
| Monthly (proxy) | $0 | $0 |
| Local inference | $0 | N/A |
| Cloud usage (opt) | Pay-as-you-go | Pay-as-you-go |
| Model switching | Single endpoint | Multiple SDKs/keys |
| Rate limiting | Built-in | Manual |
| Cost visibility | Per-model dashboard | Separate bills |
| Failover | Automatic | Manual |
The gateway doesn't add cost - it helps you reduce it by preferring local models and only falling back to cloud APIs when needed.
Troubleshooting
- Proxy won't start → Check the config YAML syntax. Run
litellm --config config.yaml --debugfor verbose logs. - Ollama models not found → Verify
ollama listshows the model, and the name matches exactly. - Slow first request → LiteLLM caches the model list on startup. First request may be slow while it connects to all providers.
- Rate limit errors → Adjust
rpm(requests per minute) in the config to match your GPU's capacity. - CORS errors → Add
--cors-origins http://localhost:3000to the startup command.
Swap components
- Skip cloud fallback → Remove the OpenAI/Anthropic entries from config for a 100% local gateway.
- Use PostgreSQL → Set
DATABASE_URLfor persistent spend logs and user management. - Add more local backends → LiteLLM supports vLLM, LM Studio, and TGI as backends alongside Ollama.
- Containerized → Run
ghcr.io/berriai/litellm:main-latestas a Docker container.
Frequently asked
What is the Multi-Model API Gateway (LiteLLM + Ollama) stack for?
LiteLLM + Ollama = a unified API gateway serving local models and 100+ cloud providers. Load balance, track costs, set rate limits - all behind a single OpenAI-compatible endpoint. It is purpose-built for Unify multiple local and cloud models behind a single OpenAI-compatible API and runs entirely on your own hardware.
How much does the Multi-Model API Gateway (LiteLLM + Ollama) stack cost?
Multi-Model API Gateway (LiteLLM + Ollama) costs around $800 in hardware up front and $0/month to run, since everything is self-hosted — no per-token or subscription fees versus a cloud equivalent.
How long does it take to set up Multi-Model API Gateway (LiteLLM + Ollama)?
Plan for roughly 20 minutes. The stack is rated advanced.
What do I need to run Multi-Model API Gateway (LiteLLM + Ollama)?
Multi-Model API Gateway (LiteLLM + Ollama) is built from 2 tool(s), 2 model(s), 2 hardware item(s). Each is listed below with a link.