Multi-Model API Gateway (LiteLLM + Ollama)

LiteLLM + Ollama = a unified API gateway serving local models and 100+ cloud providers. Load balance, track costs, set rate limits - all behind a single OpenAI-compatible endpoint.

The short answer

Multi-Model API Gateway (LiteLLM + Ollama) is a local AI stack for Unify multiple local and cloud models behind a single OpenAI-compatible API. LiteLLM + Ollama = a unified API gateway serving local models and 100+ cloud providers. Load balance, track costs, set rate limits - all behind a single OpenAI-compatible endpoint. It combines 6 components, is rated advanced, and takes about 20 minutes to set up. Expect around $800 in hardware and $0/month versus cloud.

Cost
~$800
$0/mo vs cloud
Difficulty
advanced
Setup time
~20 min
Use case
Unify multiple local and cloud models behind a single OpenAI-compatible API

~$800 hardware · $0/mo vs cloud

Multi-Model API Gateway (LiteLLM + Ollama)

A unified API gateway for all your AI models. LiteLLM is a Python SDK and proxy server that exposes 100+ LLM providers (OpenAI, Anthropic, Google, and local models) through a single OpenAI-compatible API. Connect it to Ollama for local inference, and you get a centralized gateway with load balancing, cost tracking, rate limiting, and fallback routing - all running on your own hardware.

What you get

  • Unified API - one OpenAI-compatible endpoint for all models, local and cloud
  • Load balancing - distribute requests across multiple instances of the same model
  • Fallback routing - if one model fails, automatically try another
  • Cost tracking - per-model, per-user, per-API-key spend tracking
  • Rate limiting - control requests per second per user or API key
  • Model access control - restrict which users can access which models
  • Logging & auditing - full request/response logs with latency and token counts
  • $0/mo - the gateway itself is free; pay only for cloud API usage if you use it

Architecture

ComponentRole
LiteLLMProxy server - routing, load balancing, cost tracking
OllamaServes local models for low-latency, private inference
Qwen3 30B A3BFast MoE model - 3B active params, great for most tasks
PostgreSQL (optional)Persistent logging and spend tracking

For local inference, recommended GPU: RTX 3090 24GB or RTX 4090 24GB. The gateway itself runs on any machine with Python.

Prerequisites

  • Python 3.9+
  • Ollama installed and running
  • 2 GB RAM for the proxy (more for logging)
  • Optional: PostgreSQL for persistent spend tracking

Setup

Step 1: Install LiteLLM

pip install 'litellm[proxy]'

Step 2: Configure the Proxy

Create a config.yaml:

model_list:
  # Local Ollama models
  - model_name: qwen3-local
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      rpm: 30
  - model_name: qwen3-moe-local
    litellm_params:
      model: ollama/qwen3:30b-a3b
      api_base: http://localhost:11434
      rpm: 20
  - model_name: llama3-local
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434
      rpm: 60
 
  # Fallback cloud models (optional - add API keys)
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
 
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY
 
litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    ttl: 3600
 
general_settings:
  master_key: sk-your-master-key  # Required for admin UI
  database_url: os.environ/DATABASE_URL  # Optional PostgreSQL

Step 3: Start the Proxy

litellm --config config.yaml --port 4000

The gateway is now running at http://localhost:4000 with an OpenAI-compatible API.

Step 4: Connect Any OpenAI-Compatible Tool

from openai import OpenAI
 
client = OpenAI(
    api_key="sk-your-master-key",
    base_url="http://localhost:4000/v1"
)
 
# Use local model
response = client.chat.completions.create(
    model="qwen3-local",
    messages=[{"role": "user", "content": "Hello!"}]
)

Use it

Smart Fallback Routing

LiteLLM can automatically fall back to a different model if the primary one fails:

router_settings:
  routing_strategy: latency-based-routing
  fallbacks:
    qwen3-local: [gpt-4o-mini, claude-sonnet]

If your local Ollama goes down, requests automatically route to the cloud fallback.

Load Balancing Across GPU Instances

Run Ollama on multiple machines and balance across them:

  - model_name: qwen3-local
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://ollama-node-1:11434
    - model_name: qwen3-local
      litellm_params:
        model: ollama/qwen3:14b
        api_base: http://ollama-node-2:11434

Track Spend by User

general_settings:
  max_budget: 100  # $100 total budget
  budget_duration: 30d

Assign budgets to API keys or users in the LiteLLM admin UI at http://localhost:4000/admin.

Cost vs cloud

Local LiteLLM + OllamaDirect Cloud API Access
Monthly (proxy)$0$0
Local inference$0N/A
Cloud usage (opt)Pay-as-you-goPay-as-you-go
Model switchingSingle endpointMultiple SDKs/keys
Rate limitingBuilt-inManual
Cost visibilityPer-model dashboardSeparate bills
FailoverAutomaticManual

The gateway doesn't add cost - it helps you reduce it by preferring local models and only falling back to cloud APIs when needed.

Troubleshooting

  • Proxy won't start → Check the config YAML syntax. Run litellm --config config.yaml --debug for verbose logs.
  • Ollama models not found → Verify ollama list shows the model, and the name matches exactly.
  • Slow first request → LiteLLM caches the model list on startup. First request may be slow while it connects to all providers.
  • Rate limit errors → Adjust rpm (requests per minute) in the config to match your GPU's capacity.
  • CORS errors → Add --cors-origins http://localhost:3000 to the startup command.

Swap components

  • Skip cloud fallback → Remove the OpenAI/Anthropic entries from config for a 100% local gateway.
  • Use PostgreSQL → Set DATABASE_URL for persistent spend logs and user management.
  • Add more local backends → LiteLLM supports vLLM, LM Studio, and TGI as backends alongside Ollama.
  • Containerized → Run ghcr.io/berriai/litellm:main-latest as a Docker container.

Frequently asked

What is the Multi-Model API Gateway (LiteLLM + Ollama) stack for?

LiteLLM + Ollama = a unified API gateway serving local models and 100+ cloud providers. Load balance, track costs, set rate limits - all behind a single OpenAI-compatible endpoint. It is purpose-built for Unify multiple local and cloud models behind a single OpenAI-compatible API and runs entirely on your own hardware.

How much does the Multi-Model API Gateway (LiteLLM + Ollama) stack cost?

Multi-Model API Gateway (LiteLLM + Ollama) costs around $800 in hardware up front and $0/month to run, since everything is self-hosted — no per-token or subscription fees versus a cloud equivalent.

How long does it take to set up Multi-Model API Gateway (LiteLLM + Ollama)?

Plan for roughly 20 minutes. The stack is rated advanced.

What do I need to run Multi-Model API Gateway (LiteLLM + Ollama)?

Multi-Model API Gateway (LiteLLM + Ollama) is built from 2 tool(s), 2 model(s), 2 hardware item(s). Each is listed below with a link.