What it does
Core capabilities at a glance
- PagedAttention for near-zero memory waste
- Continuous batching for maximum throughput
- Tensor parallelism across multiple GPUs
- OpenAI-compatible API server
- Prefix caching and speculative decoding
- Quantization support (AWQ, GPTQ, FP8, GGUF)
Deep dive
The full breakdown - performance, comparisons, and setup
vLLM
vLLM is the serving engine you reach for when Ollama's single-user performance isn't enough. Built at UC Berkeley's Sky Computing Lab, it introduced PagedAttention - a memory management technique that dramatically increases throughput by eliminating the memory waste that plagues other inference engines.
What it is
vLLM is a Python-based LLM serving library that optimizes for throughput and memory efficiency. Its key innovation, PagedAttention, manages the KV cache in fixed-size blocks (like virtual memory paging in operating systems), reducing fragmentation and allowing much larger batch sizes.
Unlike llama.cpp which optimizes for single-user latency, vLLM optimizes for multi-user throughput. If you have a GPU and want to serve a model to many users simultaneously, vLLM is the right choice.
Why this matters
As of June 2026, vLLM has become the standard backend for production local AI deployments:
- Continuous batching: incoming requests join the next batch automatically, keeping GPU utilization near 100%
- PagedAttention: eliminates the memory waste that limits concurrent users in other engines
- Prefix caching: when multiple users share a system prompt, the shared prefix is computed once
- Speculative decoding: use a small draft model to accelerate the large model by 2-3x
Performance you'll see
| Hardware | Model | Concurrent Users | Throughput |
|---|---|---|---|
| RTX 4090 | Qwen3 8B (AWQ) | 8 | ~400 tok/s aggregate |
| RTX 5090 | Qwen3 30B (FP8) | 6 | ~180 tok/s aggregate |
| Dual RTX 3090 | Llama 3.3 70B (AWQ) | 10 | ~140 tok/s aggregate |
| 4x A6000 | Llama 3.3 70B (FP8) | 30 | ~600 tok/s aggregate |
All numbers measured with 2k input, 512 output tokens. Source: vLLM benchmark suite.
How it stacks up
| vLLM | llama.cpp | Ollama | TGI | |
|---|---|---|---|---|
| Throughput | Best | Good | Moderate | Good |
| Multi-user | Yes (production) | Limited | No | Yes |
| GPU support | CUDA | CUDA/Metal/Vulkan | CUDA/Metal/ROCm | CUDA |
| Setup complexity | High | Medium | Low | Medium |
| Best for | Production serving | Power users | Daily driver | HuggingFace ecosystem |
What runs on it
- Open WebUI - connect it to vLLM's OpenAI-compatible endpoint
- AnythingLLM - supports vLLM as a provider
- LangChain - native vLLM integration in the Python SDK
Get started
pip install vllm
# Serve a model
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-8B \
--dtype auto \
--max-model-len 8192
# Or use docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model Qwen/Qwen3-8BWhat the community says
"vLLM turned our single-GPU Llama serving from 20 req/s to 80 req/s with the same hardware."
- u/ml-infra-eng on r/LocalLLaMA, 445 upvotes
"If you're serving models to more than 5 concurrent users, skip Ollama and go straight to vLLM."
- u/prod-ml-engineer on r/selfhosted, 289 upvotes
When to use something else
- Single-user or desktop use: Ollama or LM Studio are simpler
- Need CPU inference: use llama.cpp instead - vLLM needs CUDA
- Windows user: vLLM has limited Windows support; use TensorRT-LLM or llama.cpp
Frequently asked
Quick answers to common questions
What is vLLM?
vLLM is a inference-server tool for local AI workloads. High-throughput LLM serving engine with PagedAttention - the gold standard for production local inference.
Is vLLM free and open source?
Yes, vLLM has 82,154 GitHub stars and is licensed under Apache-2.0. You can self-host it for free on linux, docker.
What platforms does vLLM support?
vLLM runs on linux, docker.
What hardware do I need for vLLM?
The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. vLLM has 82,154 GitHub stars and an active community.
Does vLLM support GPU acceleration?
vLLM supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.
What are the best alternatives to vLLM?
Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.
How much does vLLM cost?
vLLM is free-open-source. It is completely free and open source to self-host.
Pairs well with
Complementary tools, models, and hardware
Comments coming soon
Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.