What it does

Core capabilities at a glance

PagedAttention for near-zero memory waste
Continuous batching for maximum throughput
Tensor parallelism across multiple GPUs
OpenAI-compatible API server
Prefix caching and speculative decoding
Quantization support (AWQ, GPTQ, FP8, GGUF)

Deep dive

The full breakdown - performance, comparisons, and setup

vLLM

vLLM is the serving engine you reach for when Ollama's single-user performance isn't enough. Built at UC Berkeley's Sky Computing Lab, it introduced PagedAttention - a memory management technique that dramatically increases throughput by eliminating the memory waste that plagues other inference engines.

What it is

vLLM is a Python-based LLM serving library that optimizes for throughput and memory efficiency. Its key innovation, PagedAttention, manages the KV cache in fixed-size blocks (like virtual memory paging in operating systems), reducing fragmentation and allowing much larger batch sizes.

Unlike llama.cpp which optimizes for single-user latency, vLLM optimizes for multi-user throughput. If you have a GPU and want to serve a model to many users simultaneously, vLLM is the right choice.

Why this matters

As of June 2026, vLLM has become the standard backend for production local AI deployments:

Continuous batching: incoming requests join the next batch automatically, keeping GPU utilization near 100%
PagedAttention: eliminates the memory waste that limits concurrent users in other engines
Prefix caching: when multiple users share a system prompt, the shared prefix is computed once
Speculative decoding: use a small draft model to accelerate the large model by 2-3x

Performance you'll see

Hardware	Model	Concurrent Users	Throughput
RTX 4090	Qwen3 8B (AWQ)	8	~400 tok/s aggregate
RTX 5090	Qwen3 30B (FP8)	6	~180 tok/s aggregate
Dual RTX 3090	Llama 3.3 70B (AWQ)	10	~140 tok/s aggregate
4x A6000	Llama 3.3 70B (FP8)	30	~600 tok/s aggregate

All numbers measured with 2k input, 512 output tokens. Source: vLLM benchmark suite.

How it stacks up

	vLLM	llama.cpp	Ollama	TGI
Throughput	Best	Good	Moderate	Good
Multi-user	Yes (production)	Limited	No	Yes
GPU support	CUDA	CUDA/Metal/Vulkan	CUDA/Metal/ROCm	CUDA
Setup complexity	High	Medium	Low	Medium
Best for	Production serving	Power users	Daily driver	HuggingFace ecosystem

What runs on it

Open WebUI - connect it to vLLM's OpenAI-compatible endpoint
AnythingLLM - supports vLLM as a provider
LangChain - native vLLM integration in the Python SDK

Get started

pip install vllm
 
# Serve a model
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --dtype auto \
  --max-model-len 8192
 
# Or use docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen3-8B

What the community says

"vLLM turned our single-GPU Llama serving from 20 req/s to 80 req/s with the same hardware."

u/ml-infra-eng on r/LocalLLaMA, 445 upvotes

"If you're serving models to more than 5 concurrent users, skip Ollama and go straight to vLLM."

u/prod-ml-engineer on r/selfhosted, 289 upvotes

When to use something else

Single-user or desktop use: Ollama or LM Studio are simpler
Need CPU inference: use llama.cpp instead - vLLM needs CUDA
Windows user: vLLM has limited Windows support; use TensorRT-LLM or llama.cpp

Frequently asked

Quick answers to common questions

What is vLLM?

vLLM is a inference-server tool for local AI workloads. High-throughput LLM serving engine with PagedAttention - the gold standard for production local inference.

Is vLLM free and open source?

Yes, vLLM has 86,906 GitHub stars and is licensed under Apache-2.0. You can self-host it for free on linux, docker.

What platforms does vLLM support?

vLLM runs on linux, docker.

What hardware do I need for vLLM?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. vLLM has 86,906 GitHub stars and an active community.

Does vLLM support GPU acceleration?

vLLM supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to vLLM?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does vLLM cost?

vLLM is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

vLLM

What it does

Deep dive

vLLM

What it is

Why this matters

Performance you'll see

How it stacks up

What runs on it

Get started

What the community says

When to use something else

Frequently asked

What is vLLM?

Is vLLM free and open source?

What platforms does vLLM support?

What hardware do I need for vLLM?

Does vLLM support GPU acceleration?

What are the best alternatives to vLLM?

How much does vLLM cost?

Pairs well with

Tools

Models

Hardware