llama.cpp
inference-serverFeatured115,239MIT

llama.cpp

High-performance LLM inference in pure C/C++ with GPU acceleration - the engine behind most local AI tools.

Updated Jun 7, 2026
Platforms
macos, linux, windows, docker
Pricing
free-open-source
Status
active
License
MIT

What it does

Core capabilities at a glance

  • 4-bit through 8-bit quantization (Q4_K_M, Q5_K_M, Q8_0)
  • GPU acceleration via CUDA, Metal, Vulkan, SYCL, and HIP
  • HTTP server with OpenAI-compatible API
  • Supports 100+ model architectures including GGUF format
  • Flash Attention, K/V cache quantization, and speculative decoding
  • Built-in tokenization, grammar constraints, and beam search

Deep dive

The full breakdown - performance, comparisons, and setup

llama.cpp

llama.cpp is the software that made local LLMs practical. Before it, running a 7B model required expensive GPUs, complex Python environments, and patience. llama.cpp changed that by bringing efficient quantization and CPU-friendly inference to the masses.

What it is

llama.cpp is a C/C++ inference engine originally created by Georgi Gerganov that implements the Llama architecture and 100+ derivative models. It introduced the GGUF format (preceded by GGML), which packages model weights, tokenizer, and metadata into a single file you can download and run immediately.

The project has grown far beyond its original scope. It now supports GPU acceleration via CUDA, Metal, Vulkan, SYCL, and AMD HIP; offers an HTTP server with an OpenAI-compatible API; and includes a built-in WebUI for browsing models and conversations.

Why this matters

Nearly every tool in the local AI ecosystem either wraps llama.cpp or started as a fork of it:

  • Ollama uses llama.cpp as its inference backend
  • LM Studio bundles it under the hood
  • Jan uses it via the Nitro engine (a fork)
  • text-generation-webui offers it as a loader option
  • Countless projects build on its C API and HTTP server

Understanding llama.cpp gives you insight into how all these tools work under the hood.

Performance you'll see

llama.cpp on a single GPU delivers some of the best token-per-second rates available:

HardwareModel & QuantizationToken RateNotes
RTX 4090Qwen3 8B Q4_K_M~95 tok/sIdeal for interactive use
RTX 5090Llama 3.3 70B Q4_K_M~16 tok/sFits with room for context
Mac M4 ProMistral Small 3 Q4_K_M~45 tok/sMetal acceleration performs well
CPU-only (AMD 7950X)Qwen3 8B Q4_K_M~12 tok/sNo GPU needed
Dual RTX 3090Llama 3.3 70B Q5_K_M~20 tok/sSplit across GPUs

Source: llama.cpp discussion benchmarks and r/LocalLLaMA community tests.

How it stacks up

llama.cppvLLMOllamaLM Studio
Raw performanceHighVery highMediumMedium
Beginner-friendlyNoNoYesYes
GPU supportCUDA/Metal/Vulkan/ROCmCUDA onlyCUDA/Metal/ROCmCUDA/Metal/ROCm
QuantizationFull GGUF controlLimitedCuratedCurated
APIOpenAI-compatibleOpenAI-compatibleOpenAI-compatibleOpenAI-compatible
Best forPower users, serversProduction servingDaily driverDesktop chat

What runs on it

Since llama.cpp is the foundation, any tool that works with Ollama ultimately works with llama.cpp:

  • Ollama - the easiest wrapper around llama.cpp
  • Open WebUI - ChatGPT-style frontend that connects to llama.cpp server
  • Continue - VS Code coding assistant that can use llama.cpp as backend
  • AnythingLLM - RAG platform supporting llama.cpp API

What models you can run

Any model in GGUF format from Hugging Face, plus every model in the Ollama library. Top picks:

  • Qwen3 30B - best mid-range all-rounder
  • Llama 3.3 70B - best quality if you have VRAM
  • Mistral Small 3 - fast, capable, fits anywhere

Get started

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
 
# Download a model (GGUF format)
wget https://huggingface.co/bartowski/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf
 
# Run the server
./build/bin/llama-server -m Qwen3-8B-Q4_K_M.gguf --port 8080
 
# Or just chat in terminal
./build/bin/llama-cli -m Qwen3-8B-Q4_K_M.gguf -p "Hello, how are you?"

What the community says

"llama.cpp is the single most important piece of infrastructure in local AI. Every tool in the ecosystem either wraps it or competes with it."

"The server mode with OpenAI API compatibility turned llama.cpp from a CLI toy into a production backend."

When to use something else

  • You want a beginner-friendly setup: use Ollama instead - it wraps llama.cpp with a clean CLI
  • You need production multi-user serving: use vLLM with PagedAttention and continuous batching
  • You want a desktop GUI: LM Studio or Jan are better daily drivers

But if you want full control over quantization, GPU settings, and inference parameters, nothing beats running llama.cpp directly.

Frequently asked

Quick answers to common questions

What is llama.cpp?

llama.cpp is a inference-server tool for local AI workloads. High-performance LLM inference in pure C/C++ with GPU acceleration - the engine behind most local AI tools.

Is llama.cpp free and open source?

Yes, llama.cpp has 115,239 GitHub stars and is licensed under MIT. You can self-host it for free on macos, linux, windows, docker.

What platforms does llama.cpp support?

llama.cpp runs on macos, linux, windows, docker.

What hardware do I need for llama.cpp?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. llama.cpp has 115,239 GitHub stars and an active community.

Does llama.cpp support GPU acceleration?

llama.cpp supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to llama.cpp?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does llama.cpp cost?

llama.cpp is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.