LLM Comparison

Compare open-source LLMs side-by-side. Sort by benchmark score, VRAM requirements, context length, or parameter count. Click any model for full details.

Benchmarks VRAM at Q4_K_M Parameter count
ModelParamsContextMMLUHumanEvalGSM8KVRAM Q4MT-Bench
DeepSeek V4 Pro1600B1049k90.176.892.6 - -
DeepSeek V4 Flash284B1049k88.769.590.8 - -
Kimi K2.61000B262k87.1 - 96.4 - -
Qwen3.6 27B28B262k86.182.693~16 GB8.7
GLM-5.1744B2097k86 - 95.3 - -
Llama 3.3 70B70B131k8678.492.1~40 GB8.9
Qwen2.5 72B72B131k858292~40 GB8.6
Phi-414B16k84.882.691.8~8.5 GB8.6
Qwen3 32B32B33k83.482.994~19 GB8.7
Qwen3 14B14B33k7780.588~8 GB8.5
Llama 3.1 8B8B131k69.472.184.5~5.5 GB8.1
Phi-4 Mini3B16k646578~2.5 GB7.5
EXAONE 4.5 33B34.4B262k - - - ~20 GB -
gemma-3-270m0.3B8k - - - - -
Gemma 4 26B A4B26.5B262k - - - ~15 GB -
Gemma 4 31B32.7B262k - - - ~19 GB -
gpt-oss-120b120.4B131k - - - ~70 GB -
gpt-oss-20b21.5B131k - - - ~12 GB -
LFM2.5-1.2B-Instruct1.2B128k - - - ~1 GB -
MiMo-V2.5-Pro1023.2B1049k - - - ~593 GB -

Showing top 20 models by MMLU score. Browse all 74 models or use our VRAM calculator for precise hardware sizing.

How to use this comparison

1

Find your model

Sort by benchmark (MMLU for reasoning, HumanEval for code) to find the strongest model in your target size class.

2

Check VRAM fit

Look at the Q4_K_M VRAM column. Compare with your GPU's memory. RTX 4090 = 24 GB, RTX 5090 = 32 GB, Mac Studio = up to 192 GB unified.

3

Click through

Each model links to its full page with VRAM tables, quantization guides, community quotes, and recommended hardware.

Frequently asked

Which benchmark should I care about most?

MMLU (Massive Multitask Language Understanding) measures general reasoning across 57 subjects - it is the most cited single metric. HumanEval is best for code generation. GSM8K tests math reasoning. MT-Bench measures conversational quality via LLM-as-judge.

How accurate is the VRAM estimate?

VRAM at Q4_K_M is calculated from actual GGUF file sizes in llama.cpp. Real usage varies by batch size, context length, and inference engine. Add ~1.5 GB overhead for KV-cache and runtime. Use our VRAM calculator for precise estimates.

Why do some models not have benchmarks?

Models recently added may not have verified benchmark scores yet. We update benchmarks as they become available from the community. Scores marked with a dash will be filled in on the next content refresh.

What quantization should I use?

Q4_K_M is the consensus sweet spot for quality/size ratio. It preserves ~98% of FP16 quality while using only ~27% of the memory. Q5_K_M and Q8_0 offer marginal quality gains at significantly higher VRAM costs.