What it does
Core capabilities at a glance
- 4-bit through 8-bit quantization (Q4_K_M, Q5_K_M, Q8_0)
- GPU acceleration via CUDA, Metal, Vulkan, SYCL, and HIP
- HTTP server with OpenAI-compatible API
- Supports 100+ model architectures including GGUF format
- Flash Attention, K/V cache quantization, and speculative decoding
- Built-in tokenization, grammar constraints, and beam search
Deep dive
The full breakdown - performance, comparisons, and setup
llama.cpp
llama.cpp is the software that made local LLMs practical. Before it, running a 7B model required expensive GPUs, complex Python environments, and patience. llama.cpp changed that by bringing efficient quantization and CPU-friendly inference to the masses.
What it is
llama.cpp is a C/C++ inference engine originally created by Georgi Gerganov that implements the Llama architecture and 100+ derivative models. It introduced the GGUF format (preceded by GGML), which packages model weights, tokenizer, and metadata into a single file you can download and run immediately.
The project has grown far beyond its original scope. It now supports GPU acceleration via CUDA, Metal, Vulkan, SYCL, and AMD HIP; offers an HTTP server with an OpenAI-compatible API; and includes a built-in WebUI for browsing models and conversations.
Why this matters
Nearly every tool in the local AI ecosystem either wraps llama.cpp or started as a fork of it:
- Ollama uses llama.cpp as its inference backend
- LM Studio bundles it under the hood
- Jan uses it via the Nitro engine (a fork)
- text-generation-webui offers it as a loader option
- Countless projects build on its C API and HTTP server
Understanding llama.cpp gives you insight into how all these tools work under the hood.
Performance you'll see
llama.cpp on a single GPU delivers some of the best token-per-second rates available:
| Hardware | Model & Quantization | Token Rate | Notes |
|---|---|---|---|
| RTX 4090 | Qwen3 8B Q4_K_M | ~95 tok/s | Ideal for interactive use |
| RTX 5090 | Llama 3.3 70B Q4_K_M | ~16 tok/s | Fits with room for context |
| Mac M4 Pro | Mistral Small 3 Q4_K_M | ~45 tok/s | Metal acceleration performs well |
| CPU-only (AMD 7950X) | Qwen3 8B Q4_K_M | ~12 tok/s | No GPU needed |
| Dual RTX 3090 | Llama 3.3 70B Q5_K_M | ~20 tok/s | Split across GPUs |
Source: llama.cpp discussion benchmarks and r/LocalLLaMA community tests.
How it stacks up
| llama.cpp | vLLM | Ollama | LM Studio | |
|---|---|---|---|---|
| Raw performance | High | Very high | Medium | Medium |
| Beginner-friendly | No | No | Yes | Yes |
| GPU support | CUDA/Metal/Vulkan/ROCm | CUDA only | CUDA/Metal/ROCm | CUDA/Metal/ROCm |
| Quantization | Full GGUF control | Limited | Curated | Curated |
| API | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible |
| Best for | Power users, servers | Production serving | Daily driver | Desktop chat |
What runs on it
Since llama.cpp is the foundation, any tool that works with Ollama ultimately works with llama.cpp:
- Ollama - the easiest wrapper around llama.cpp
- Open WebUI - ChatGPT-style frontend that connects to llama.cpp server
- Continue - VS Code coding assistant that can use llama.cpp as backend
- AnythingLLM - RAG platform supporting llama.cpp API
What models you can run
Any model in GGUF format from Hugging Face, plus every model in the Ollama library. Top picks:
- Qwen3 30B - best mid-range all-rounder
- Llama 3.3 70B - best quality if you have VRAM
- Mistral Small 3 - fast, capable, fits anywhere
Get started
# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
# Download a model (GGUF format)
wget https://huggingface.co/bartowski/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf
# Run the server
./build/bin/llama-server -m Qwen3-8B-Q4_K_M.gguf --port 8080
# Or just chat in terminal
./build/bin/llama-cli -m Qwen3-8B-Q4_K_M.gguf -p "Hello, how are you?"What the community says
"llama.cpp is the single most important piece of infrastructure in local AI. Every tool in the ecosystem either wraps it or competes with it."
- u/local-llm-dev on r/LocalLLaMA, 534 upvotes
"The server mode with OpenAI API compatibility turned llama.cpp from a CLI toy into a production backend."
- u/selfhosted-engineer on r/selfhosted, 312 upvotes
When to use something else
- You want a beginner-friendly setup: use Ollama instead - it wraps llama.cpp with a clean CLI
- You need production multi-user serving: use vLLM with PagedAttention and continuous batching
- You want a desktop GUI: LM Studio or Jan are better daily drivers
But if you want full control over quantization, GPU settings, and inference parameters, nothing beats running llama.cpp directly.
Frequently asked
Quick answers to common questions
What is llama.cpp?
llama.cpp is a inference-server tool for local AI workloads. High-performance LLM inference in pure C/C++ with GPU acceleration - the engine behind most local AI tools.
Is llama.cpp free and open source?
Yes, llama.cpp has 115,239 GitHub stars and is licensed under MIT. You can self-host it for free on macos, linux, windows, docker.
What platforms does llama.cpp support?
llama.cpp runs on macos, linux, windows, docker.
What hardware do I need for llama.cpp?
The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. llama.cpp has 115,239 GitHub stars and an active community.
Does llama.cpp support GPU acceleration?
llama.cpp supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.
What are the best alternatives to llama.cpp?
Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.
How much does llama.cpp cost?
llama.cpp is free-open-source. It is completely free and open source to self-host.
Pairs well with
Complementary tools, models, and hardware
Comments coming soon
Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.