LLM Comparison

Compare open-source LLMs side-by-side. Sort by benchmark score, VRAM requirements, context length, or parameter count. Click any model for full details.

Benchmarks VRAM at Q4_K_M Parameter count

Model	Params	Context	MMLU	HumanEval	GSM8K	VRAM Q4	MT-Bench
DeepSeek V4 Pro	1600B	1049k	90.1	76.8	92.6	-	-
DeepSeek V4 Flash	284B	1049k	88.7	69.5	90.8	-	-
Kimi K2.6	1000B	262k	87.1	-	96.4	-	-
Qwen3.6 27B	28B	262k	86.1	82.6	93	~16 GB	8.7
GLM-5.1	744B	2097k	86	-	95.3	-	-
Llama 3.3 70B	70B	131k	86	78.4	92.1	~40 GB	8.9
Qwen2.5 72B	72B	131k	85	82	92	~40 GB	8.6
Phi-4	14B	16k	84.8	82.6	91.8	~8.5 GB	8.6
Qwen3 32B	32B	33k	83.4	82.9	94	~19 GB	8.7
Qwen3 14B	14B	33k	77	80.5	88	~8 GB	8.5
Llama 3.1 8B	8B	131k	69.4	72.1	84.5	~5.5 GB	8.1
Phi-4 Mini	3B	16k	64	65	78	~2.5 GB	7.5
EXAONE 4.5 33B	34.4B	262k	-	-	-	~20 GB	-
gemma-3-270m	0.3B	8k	-	-	-	-	-
Gemma 4 12B	12B	262k	-	-	-	~7 GB	-
Gemma 4 26B A4B	26.5B	262k	-	-	-	~15 GB	-
Gemma 4 31B	32.7B	262k	-	-	-	~19 GB	-
GLM-5.2	753.4B	1049k	-	-	-	~437 GB	-
gpt-oss-120b	120.4B	131k	-	-	-	~70 GB	-
gpt-oss-20b	21.5B	131k	-	-	-	~12 GB	-

Showing top 20 models by MMLU score. Browse all 95 models or use our VRAM calculator for precise hardware sizing.

How to use this comparison

Find your model

Sort by benchmark (MMLU for reasoning, HumanEval for code) to find the strongest model in your target size class.

Check VRAM fit

Look at the Q4_K_M VRAM column. Compare with your GPU's memory. RTX 4090 = 24 GB, RTX 5090 = 32 GB, Mac Studio = up to 192 GB unified.

Click through

Each model links to its full page with VRAM tables, quantization guides, community quotes, and recommended hardware.

Frequently asked

Which benchmark should I care about most?

MMLU (Massive Multitask Language Understanding) measures general reasoning across 57 subjects - it is the most cited single metric. HumanEval is best for code generation. GSM8K tests math reasoning. MT-Bench measures conversational quality via LLM-as-judge.

How accurate is the VRAM estimate?

VRAM at Q4_K_M is calculated from actual GGUF file sizes in llama.cpp. Real usage varies by batch size, context length, and inference engine. Add ~1.5 GB overhead for KV-cache and runtime. Use our VRAM calculator for precise estimates.

Why do some models not have benchmarks?

Models recently added may not have verified benchmark scores yet. We update benchmarks as they become available from the community. Scores marked with a dash will be filled in on the next content refresh.

What quantization should I use?

Q4_K_M is the consensus sweet spot for quality/size ratio. It preserves ~98% of FP16 quality while using only ~27% of the memory. Q5_K_M and Q8_0 offer marginal quality gains at significantly higher VRAM costs.