Llama 3.3 70B
Intelligence benchmarks
Artificial Analysis indexes - compared with the best open and proprietary models
Intelligence
14.5
AA Index
Coding
10.7
AA Index
Agentic
9.1
AA Index
Math
7.7
AA Index
Intelligence Index - Llama 3.3 70B vs. the field
Best open-weight models (you can run locally) and leading proprietary models for context.
Coding Index comparison
Agentic Index comparison
Math Index comparison
Benchmark data from Artificial Analysis · updated 2026-06-07.
Standard benchmarks
Performance across standard evaluations
| Benchmark | Score |
|---|---|
| MMLU | 86 |
| HumanEval | 78.4 |
| MT-Bench | 8.9 |
| GSM8K | 92.1 |
| MMLUPRO | 71.3 |
| GPQA | 49.8 |
| AIME | 30 |
Will it run on your hardware?
Pick your GPU memory - see which quantizations fit, and the cheapest card for the rest
Need an exact figure for your context length? Use the VRAM calculator.
Run it locally
Copy-paste - running in under a minute
ollama run llama3.3:70bvllm serve meta-llama/Llama-3.3-70B-InstructNew to this? Start with Ollama · serve to many users with vLLM.
Deep dive
Notes, sources, and the full write-up
Llama 3.3 70B
Short answer: Llama 3.3 70B is Meta's instruction-tuned 70B model with 131k context, ~86 MMLU, and Apache-style permissive licensing. In mid-2026 it remains the default "I need maximum quality without going to a 200B+ model" pick. Needs 40 GB VRAM at Q4_K_M - practical setups are dual RTX 3090, an RTX 5090, or any Mac Studio M4 Ultra.
Why it's still on the shortlist in 2026
Despite Llama 4 and the GLM-5 series existing, Llama 3.3 70B is the model with:
- The most production deployments - every serving framework, every quant format, fully battle-tested
- The most agent-tuning data - toolformer/function-call fine-tunes are mature
- The most third-party fine-tunes on HuggingFace (coding, medical, legal verticals)
- The most permissive 70B license - Meta's Llama 3.3 license allows commercial use up to 700M MAU
- Best long-context behavior at 131k of any open 70B - needle-in-haystack stays >95% to 80k tokens
Benchmarks
| Benchmark | Score | Notes |
|---|---|---|
| MMLU | 86.0 | Reasoning |
| HumanEval | 78.4 | Code generation |
| GSM8K | 92.1 | Math word problems |
| MT-Bench | 8.9 | Conversational quality |
VRAM math
| Quant | VRAM | Practical hardware |
|---|---|---|
| Q3_K_M | ~30 GB | RTX 4090 (tight, ~12 tok/s), RTX 5090 (comfortable) |
| Q4_K_M | ~40 GB | Dual RTX 3090 / RTX 5090 + 8GB offload / Mac Studio M4 Ultra |
| Q5_K_M | ~48 GB | Dual RTX 4090 / Mac Studio M4 Ultra 64GB+ |
| Q8_0 | ~75 GB | Dual RTX 5090 / Mac Studio M4 Ultra 96GB+ |
| FP16 | ~140 GB | Server territory |
Run our VRAM calculator for your exact context length.
Real-world performance
Single-user, 2k context:
| Hardware | Quant | Tokens/sec |
|---|---|---|
| RTX 5090 | Q4_K_M (partial offload) | ~14 tok/s |
| Dual RTX 3090 (NVLink) | Q4_K_M | ~16 tok/s |
| Dual RTX 4090 | Q5_K_M | ~22 tok/s |
| Mac Studio M4 Ultra 192GB | Q4_K_M (MLX) | ~13 tok/s |
| Mac Studio M4 Ultra 192GB | Q8_0 (MLX) | ~9 tok/s |
How it compares
| Llama 3.3 70B | Qwen3 30B | Mistral Small 3 | GLM-5 9B | |
|---|---|---|---|---|
| MMLU | 86.0 | 83.4 | 75.2 | 78.1 |
| HumanEval | 78.4 | 80.5 | 71.2 | 73.0 |
| Q4 VRAM | 40 GB | 18 GB | 14 GB | 6 GB |
| Fits on 4090 | ✗ (offload needed) | ✓ | ✓ | ✓ |
| Best when | top quality, big VRAM | balanced prosumer | small + capable | very VRAM-constrained |
Pick Llama 3.3 70B if you have ≥48 GB VRAM and want maximum open-weight quality. Pick Qwen3 30B if you have a single 24 GB GPU.
Frequently asked
Can I run it on a single RTX 4090?
Yes, at Q3_K_M with ~6 GB system RAM offload, ~12 tok/s. Usable but tight. For a 4090, Qwen3 30B is the better single-GPU pick.
What's better in 2026 - Llama 3.3 70B or Llama 4 Scout?
Llama 4 Scout (MoE 109B/17B active) is faster per-token but spikier in quality. Llama 3.3 70B is more consistent and has more ecosystem support. For most users in mid-2026, Llama 3.3 70B is still safer.
Is the 131k context actually usable?
Yes for retrieval/needle tasks up to ~80k tokens. Quality drift becomes noticeable past 100k. For reliable long-context, chunk and retrieve.
How to run
ollama run llama3.3:70bvLLM (production):
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--quantization awq --tensor-parallel-size 2Frequently asked
Quick answers to common questions
How much VRAM does Llama 3.3 70B need?
Llama 3.3 70B with 70B parameters needs approximately 40 GB at Q4_K_M quantization. Use our VRAM calculator for an exact estimate.
Is Llama 3.3 70B better than other Llama models?
Llama 3.3 70B scores 86 on MMLU and 78.4 on HumanEval. It has 70B parameters with 131,072 context - a strong choice for reasoning, coding, general-chat.
What license is Llama 3.3 70B under?
Llama 3.3 70B is released under the Llama 3.3 Community License license, making it suitable for most commercial and personal projects.
What hardware runs Llama 3.3 70B well?
With 70B parameters, Llama 3.3 70B requires adequate VRAM. High-end GPUs like the RTX 4090 (24GB), RTX 5090 (32GB), or Mac Studio with unified memory are good options. Check our hardware directory for specific recommendations.
What is the best quantization for Llama 3.3 70B?
Q4_K_M is the recommended sweet spot - ~98% of FP16 quality at ~27% of the size. Q5_K_M (~48 GB) is an option if you have spare VRAM. Use our VRAM calculator to compare.
How long can Llama 3.3 70B's context window handle?
Llama 3.3 70B supports a 131,072-token context window - enough for very long documents, codebases, or multi-turn conversations. Real-world usable context may vary by implementation.
What models compete with Llama 3.3 70B?
Llama 3.3 70B competes with other 35B–105B. Browse our model directory for comparisons, benchmarks, and community reviews to find the best fit.
Compare & pair with
Similar models and recommended hardware
Related models
Recommended hardware
Nearby options
Similar models and compatible hardware by spec
Comments coming soon
Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.