VRAM Calculator
Pick a model, a quantization, and your context length. We'll tell you the exact VRAM needed and the cheapest GPU on the market that fits it.
How this works
VRAM requirement uses the same formula as our shared calculator utilities:
calcVram(paramsB, bytesPerParam, contextLength)// params × bytes_per_param + context × 0.0002 GB + 1.5 GB overhead
Quantization formats
Source: llama.cpp docs + r/LocalLLaMA community benchmarks. KV-cache is included in the overhead estimate; actual usage may vary ±10%. We round up to the nearest 0.5 GB.
Frequently asked
How accurate is this VRAM estimate?
Within roughly ±10% for GGUF quantizations in llama.cpp / Ollama. We base the math on llama.cpp's documented per-quant byte footprints plus a 1.5 GB overhead for KV-cache and runtime. Real usage varies with batch size, draft model, and inference engine.
Why is VRAM, not RAM, the bottleneck?
LLM inference is memory-bandwidth-bound. Weights must live in fast GPU memory (VRAM) to feed the matrix multiplies at GPU speed. If weights spill to system RAM, throughput collapses to single-digit tokens/sec because PCIe is ~50× slower than VRAM.
Does Apple Silicon work differently?
Yes - unified memory means all RAM is "VRAM". A Mac Studio M4 Ultra with 192 GB unified can hold models that would need a multi-GPU server on NVIDIA. Per-token speed is slower than a 4090/5090 but the VRAM ceiling is far higher per dollar at the top end.
What quantization should I pick?
Q4_K_M is the consensus sweet spot - ~98% of FP16 quality at ~27% of the size. Step up to Q5_K_M or Q8_0 only if you have spare VRAM and need the last few percent on hard reasoning. Q3 and below noticeably degrade.