vLLM
inference-serverFeatured82,154Apache-2.0

vLLM

High-throughput LLM serving engine with PagedAttention - the gold standard for production local inference.

Updated Jun 7, 2026
Platforms
linux, docker
Pricing
free-open-source
Status
active
License
Apache-2.0

What it does

Core capabilities at a glance

  • PagedAttention for near-zero memory waste
  • Continuous batching for maximum throughput
  • Tensor parallelism across multiple GPUs
  • OpenAI-compatible API server
  • Prefix caching and speculative decoding
  • Quantization support (AWQ, GPTQ, FP8, GGUF)

Deep dive

The full breakdown - performance, comparisons, and setup

vLLM

vLLM is the serving engine you reach for when Ollama's single-user performance isn't enough. Built at UC Berkeley's Sky Computing Lab, it introduced PagedAttention - a memory management technique that dramatically increases throughput by eliminating the memory waste that plagues other inference engines.

What it is

vLLM is a Python-based LLM serving library that optimizes for throughput and memory efficiency. Its key innovation, PagedAttention, manages the KV cache in fixed-size blocks (like virtual memory paging in operating systems), reducing fragmentation and allowing much larger batch sizes.

Unlike llama.cpp which optimizes for single-user latency, vLLM optimizes for multi-user throughput. If you have a GPU and want to serve a model to many users simultaneously, vLLM is the right choice.

Why this matters

As of June 2026, vLLM has become the standard backend for production local AI deployments:

  1. Continuous batching: incoming requests join the next batch automatically, keeping GPU utilization near 100%
  2. PagedAttention: eliminates the memory waste that limits concurrent users in other engines
  3. Prefix caching: when multiple users share a system prompt, the shared prefix is computed once
  4. Speculative decoding: use a small draft model to accelerate the large model by 2-3x

Performance you'll see

HardwareModelConcurrent UsersThroughput
RTX 4090Qwen3 8B (AWQ)8~400 tok/s aggregate
RTX 5090Qwen3 30B (FP8)6~180 tok/s aggregate
Dual RTX 3090Llama 3.3 70B (AWQ)10~140 tok/s aggregate
4x A6000Llama 3.3 70B (FP8)30~600 tok/s aggregate

All numbers measured with 2k input, 512 output tokens. Source: vLLM benchmark suite.

How it stacks up

vLLMllama.cppOllamaTGI
ThroughputBestGoodModerateGood
Multi-userYes (production)LimitedNoYes
GPU supportCUDACUDA/Metal/VulkanCUDA/Metal/ROCmCUDA
Setup complexityHighMediumLowMedium
Best forProduction servingPower usersDaily driverHuggingFace ecosystem

What runs on it

  • Open WebUI - connect it to vLLM's OpenAI-compatible endpoint
  • AnythingLLM - supports vLLM as a provider
  • LangChain - native vLLM integration in the Python SDK

Get started

pip install vllm
 
# Serve a model
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --dtype auto \
  --max-model-len 8192
 
# Or use docker
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen3-8B

What the community says

"vLLM turned our single-GPU Llama serving from 20 req/s to 80 req/s with the same hardware."

"If you're serving models to more than 5 concurrent users, skip Ollama and go straight to vLLM."

When to use something else

  • Single-user or desktop use: Ollama or LM Studio are simpler
  • Need CPU inference: use llama.cpp instead - vLLM needs CUDA
  • Windows user: vLLM has limited Windows support; use TensorRT-LLM or llama.cpp

Frequently asked

Quick answers to common questions

What is vLLM?

vLLM is a inference-server tool for local AI workloads. High-throughput LLM serving engine with PagedAttention - the gold standard for production local inference.

Is vLLM free and open source?

Yes, vLLM has 82,154 GitHub stars and is licensed under Apache-2.0. You can self-host it for free on linux, docker.

What platforms does vLLM support?

vLLM runs on linux, docker.

What hardware do I need for vLLM?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. vLLM has 82,154 GitHub stars and an active community.

Does vLLM support GPU acceleration?

vLLM supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to vLLM?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does vLLM cost?

vLLM is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.