LLM Comparison
Compare open-source LLMs side-by-side. Sort by benchmark score, VRAM requirements, context length, or parameter count. Click any model for full details.
| Model | Params | Context | MMLU | HumanEval | GSM8K | VRAM Q4 | MT-Bench |
|---|---|---|---|---|---|---|---|
| DeepSeek V4 Pro | 1600B | 1049k | 90.1 | 76.8 | 92.6 | - | - |
| DeepSeek V4 Flash | 284B | 1049k | 88.7 | 69.5 | 90.8 | - | - |
| Kimi K2.6 | 1000B | 262k | 87.1 | - | 96.4 | - | - |
| Qwen3.6 27B | 28B | 262k | 86.1 | 82.6 | 93 | ~16 GB | 8.7 |
| GLM-5.1 | 744B | 2097k | 86 | - | 95.3 | - | - |
| Llama 3.3 70B | 70B | 131k | 86 | 78.4 | 92.1 | ~40 GB | 8.9 |
| Qwen2.5 72B | 72B | 131k | 85 | 82 | 92 | ~40 GB | 8.6 |
| Phi-4 | 14B | 16k | 84.8 | 82.6 | 91.8 | ~8.5 GB | 8.6 |
| Qwen3 32B | 32B | 33k | 83.4 | 82.9 | 94 | ~19 GB | 8.7 |
| Qwen3 14B | 14B | 33k | 77 | 80.5 | 88 | ~8 GB | 8.5 |
| Llama 3.1 8B | 8B | 131k | 69.4 | 72.1 | 84.5 | ~5.5 GB | 8.1 |
| Phi-4 Mini | 3B | 16k | 64 | 65 | 78 | ~2.5 GB | 7.5 |
| EXAONE 4.5 33B | 34.4B | 262k | - | - | - | ~20 GB | - |
| gemma-3-270m | 0.3B | 8k | - | - | - | - | - |
| Gemma 4 26B A4B | 26.5B | 262k | - | - | - | ~15 GB | - |
| Gemma 4 31B | 32.7B | 262k | - | - | - | ~19 GB | - |
| gpt-oss-120b | 120.4B | 131k | - | - | - | ~70 GB | - |
| gpt-oss-20b | 21.5B | 131k | - | - | - | ~12 GB | - |
| LFM2.5-1.2B-Instruct | 1.2B | 128k | - | - | - | ~1 GB | - |
| MiMo-V2.5-Pro | 1023.2B | 1049k | - | - | - | ~593 GB | - |
Showing top 20 models by MMLU score. Browse all 74 models or use our VRAM calculator for precise hardware sizing.
How to use this comparison
Find your model
Sort by benchmark (MMLU for reasoning, HumanEval for code) to find the strongest model in your target size class.
Check VRAM fit
Look at the Q4_K_M VRAM column. Compare with your GPU's memory. RTX 4090 = 24 GB, RTX 5090 = 32 GB, Mac Studio = up to 192 GB unified.
Click through
Each model links to its full page with VRAM tables, quantization guides, community quotes, and recommended hardware.
Frequently asked
Which benchmark should I care about most?
MMLU (Massive Multitask Language Understanding) measures general reasoning across 57 subjects - it is the most cited single metric. HumanEval is best for code generation. GSM8K tests math reasoning. MT-Bench measures conversational quality via LLM-as-judge.
How accurate is the VRAM estimate?
VRAM at Q4_K_M is calculated from actual GGUF file sizes in llama.cpp. Real usage varies by batch size, context length, and inference engine. Add ~1.5 GB overhead for KV-cache and runtime. Use our VRAM calculator for precise estimates.
Why do some models not have benchmarks?
Models recently added may not have verified benchmark scores yet. We update benchmarks as they become available from the community. Scores marked with a dash will be filled in on the next content refresh.
What quantization should I use?
Q4_K_M is the consensus sweet spot for quality/size ratio. It preserves ~98% of FP16 quality while using only ~27% of the memory. Q5_K_M and Q8_0 offer marginal quality gains at significantly higher VRAM costs.