What it does
Core capabilities at a glance
- Highly optimized GPTQ inference kernel
- FlashAttention for long context support
- 4-bit and 8-bit matrix multiplication
- Dynamic quantization with FP16 fallback
- Low VRAM usage with batch inference
- Python API and example web server
Deep dive
The full breakdown - performance, comparisons, and setup
ExLlamaV2
ExLlamaV2 is the fastest inference library for GPTQ-quantized models on a single GPU. If you're running 4-bit quantized models and want maximum token throughput, this is your engine.
What it is
ExLlamaV2 is a Python/C++ inference library focused on GPTQ-quantized models. It's the backend that powers TabbyAPI and the ExLlamaV2 loader in text-generation-webui.
Get started
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer
config = ExLlamaV2Config()
config.model_dir = "/path/to/gptq/model"
model = ExLlamaV2(config)
model.load()
# Generate tokens...When to use something else
- Need GGUF support: use llama.cpp
- Multi-GPU production: use vLLM or TensorRT-LLM
- Easy API: use TabbyAPI which wraps ExLlamaV2
Frequently asked
Quick answers to common questions
What is ExLlamaV2?
ExLlamaV2 is a inference-server tool for local AI workloads. Fast inference library for quantized LLMs optimized for single-GPU with extreme token throughput.
Is ExLlamaV2 free and open source?
Yes, ExLlamaV2 has 4,544 GitHub stars and is licensed under MIT. You can self-host it for free on linux, windows.
What platforms does ExLlamaV2 support?
ExLlamaV2 runs on linux, windows.
What hardware do I need for ExLlamaV2?
The hardware requirements depend on which models you run. Check our hardware directory for compatible GPUs and systems. ExLlamaV2 has 4,544 GitHub stars and an active community.
Does ExLlamaV2 support GPU acceleration?
ExLlamaV2 supports GPU acceleration via CUDA, Metal, or Vulkan depending on your platform. For the best performance, pair it with an NVIDIA RTX 4090 or 5090.
What are the best alternatives to ExLlamaV2?
Popular alternatives include other inference-server tools in our directory. Browse our full collection at /tool for comparisons, community reviews, and benchmark data to find the right fit for your workflow.
How much does ExLlamaV2 cost?
ExLlamaV2 is free-open-source. It is completely free and open source to self-host.
Pairs well with
Complementary tools, models, and hardware
Comments coming soon
Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.