ExLlamaV2
inference-server4,544MIT

ExLlamaV2

Fast inference library for quantized LLMs optimized for single-GPU with extreme token throughput.

Updated Jun 7, 2026
Platforms
linux, windows
Pricing
free-open-source
Status
active
License
MIT

What it does

Core capabilities at a glance

  • Highly optimized GPTQ inference kernel
  • FlashAttention for long context support
  • 4-bit and 8-bit matrix multiplication
  • Dynamic quantization with FP16 fallback
  • Low VRAM usage with batch inference
  • Python API and example web server

Deep dive

The full breakdown - performance, comparisons, and setup

ExLlamaV2

ExLlamaV2 is the fastest inference library for GPTQ-quantized models on a single GPU. If you're running 4-bit quantized models and want maximum token throughput, this is your engine.

What it is

ExLlamaV2 is a Python/C++ inference library focused on GPTQ-quantized models. It's the backend that powers TabbyAPI and the ExLlamaV2 loader in text-generation-webui.

Get started

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer
config = ExLlamaV2Config()
config.model_dir = "/path/to/gptq/model"
model = ExLlamaV2(config)
model.load()
# Generate tokens...

When to use something else

Frequently asked

Quick answers to common questions

What is ExLlamaV2?

ExLlamaV2 is a inference-server tool for local AI workloads. Fast inference library for quantized LLMs optimized for single-GPU with extreme token throughput.

Is ExLlamaV2 free and open source?

Yes, ExLlamaV2 has 4,544 GitHub stars and is licensed under MIT. You can self-host it for free on linux, windows.

What platforms does ExLlamaV2 support?

ExLlamaV2 runs on linux, windows.

What hardware do I need for ExLlamaV2?

The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. ExLlamaV2 has 4,544 GitHub stars and an active community.

Does ExLlamaV2 support GPU acceleration?

ExLlamaV2 supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.

What are the best alternatives to ExLlamaV2?

Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.

How much does ExLlamaV2 cost?

ExLlamaV2 is free-open-source. It is completely free and open source to self-host.

Pairs well with

Complementary tools, models, and hardware

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.