Question 1

How accurate is this VRAM estimate?

Accepted Answer

Within roughly ±10% for GGUF quantizations in llama.cpp / Ollama. We base the math on llama.cpp's documented per-quant byte footprints plus a 1.5 GB overhead for KV-cache and runtime. Real usage varies with batch size, draft model, and inference engine.

Question 2

Why is VRAM, not RAM, the bottleneck?

Accepted Answer

LLM inference is memory-bandwidth-bound. Weights must live in fast GPU memory (VRAM) to feed the matrix multiplies at GPU speed. If weights spill to system RAM, throughput collapses to single-digit tokens/sec because PCIe is ~50× slower than VRAM.

Question 3

Does Apple Silicon work differently?

Accepted Answer

Yes  -  unified memory means all RAM is "VRAM". A Mac Studio M4 Ultra with 192 GB unified can hold models that would need a multi-GPU server on NVIDIA. Per-token speed is slower than a 4090/5090 but the VRAM ceiling is far higher per dollar at the top end.

Question 4

What quantization should I pick?

Accepted Answer

Q4_K_M is the consensus sweet spot  -  ~98% of FP16 quality at ~27% of the size. Step up to Q5_K_M or Q8_0 only if you have spare VRAM and need the last few percent on hard reasoning. Q3 and below noticeably degrade.

VRAM Calculator

How this works

Quantization formats

Frequently asked

How accurate is this VRAM estimate?

Why is VRAM, not RAM, the bottleneck?

Does Apple Silicon work differently?

What quantization should I pick?