Llama 3.3 70B
LlamaFeaturedLlama 3.3 Community Licensetext

Llama 3.3 70B

Updated Jun 7, 2026
Parameters
70B
Context
131,072
License
Llama 3.3 Community License
Updated
Jun 7, 2026

Intelligence benchmarks

Artificial Analysis indexes - compared with the best open and proprietary models

Intelligence

14.5

AA Index

Coding

10.7

AA Index

Agentic

9.1

AA Index

Math

7.7

AA Index

Intelligence Index - Llama 3.3 70B vs. the field

Best open-weight models (you can run locally) and leading proprietary models for context.

Claude Opus 4.8 (max)
61.4
closed
GPT-5.5 (xhigh)
60.2
closed
Claude Opus 4.7 (max)
57.3
closed
Gemini 3.1 Pro Preview
57.2
closed
Qwen3.7 Max
56.6
closed
Kimi K2.6
53.9
open
MiMo-V2.5-Pro
53.8
open
Llama 3.3 70B
14.5
open

Coding Index comparison

GPT-5.5 (xhigh)
59.1
closed
Claude Opus 4.8 (max)
56.7
closed
Gemini 3.1 Pro Preview
55.5
closed
Claude Opus 4.7 (Non-reasoning, high)
53.1
closed
GPT-5.3 Codex (xhigh)
53.1
closed
DeepSeek V4 Pro (Max)
47.5
open
Kimi K2.6
47.1
open
Llama 3.3 70B
10.7
open

Agentic Index comparison

Claude Opus 4.8 (max)
77.8
closed
GPT-5.5 (xhigh)
74.1
closed
Claude Opus 4.7 (max)
71.3
closed
Gemini 3.5 Flash (medium)
70.4
closed
MiniMax-M3
68.6
closed
MiMo-V2.5-Pro
67.4
open
DeepSeek V4 Pro (Max)
67.2
open
Llama 3.3 70B
9.1
open

Math Index comparison

Nova 2.0 Lite (high)
94.3
closed
gpt-oss-120b (high)
93.4
open
NVIDIA Nemotron 3 Nano
91
open
K-EXAONE
90.3
open
Nova 2.0 Omni (medium)
89.7
closed
gpt-oss-20B (high)
89.3
open
Nova 2.0 Pro Preview (medium)
89
closed
Llama 3.3 70B
7.7
open

Benchmark data from Artificial Analysis · updated 2026-06-07.

Standard benchmarks

Performance across standard evaluations

BenchmarkScore
MMLU86
HumanEval78.4
MT-Bench8.9
GSM8K92.1
MMLUPRO71.3
GPQA49.8
AIME30

Will it run on your hardware?

Pick your GPU memory - see which quantizations fit, and the cheapest card for the rest

Too big for 24 GB at any quant
0 of 4 quantizations fit Llama 3.3 70B with real runtime overhead.

Need an exact figure for your context length? Use the VRAM calculator.

Run it locally

Copy-paste - running in under a minute

Ollamaeasiest
ollama run llama3.3:70b
vLLMOpenAI-compatible API
vllm serve meta-llama/Llama-3.3-70B-Instruct

New to this? Start with Ollama · serve to many users with vLLM.

Deep dive

Notes, sources, and the full write-up

Llama 3.3 70B

Short answer: Llama 3.3 70B is Meta's instruction-tuned 70B model with 131k context, ~86 MMLU, and Apache-style permissive licensing. In mid-2026 it remains the default "I need maximum quality without going to a 200B+ model" pick. Needs 40 GB VRAM at Q4_K_M - practical setups are dual RTX 3090, an RTX 5090, or any Mac Studio M4 Ultra.

Why it's still on the shortlist in 2026

Despite Llama 4 and the GLM-5 series existing, Llama 3.3 70B is the model with:

  1. The most production deployments - every serving framework, every quant format, fully battle-tested
  2. The most agent-tuning data - toolformer/function-call fine-tunes are mature
  3. The most third-party fine-tunes on HuggingFace (coding, medical, legal verticals)
  4. The most permissive 70B license - Meta's Llama 3.3 license allows commercial use up to 700M MAU
  5. Best long-context behavior at 131k of any open 70B - needle-in-haystack stays >95% to 80k tokens

Benchmarks

BenchmarkScoreNotes
MMLU86.0Reasoning
HumanEval78.4Code generation
GSM8K92.1Math word problems
MT-Bench8.9Conversational quality

VRAM math

QuantVRAMPractical hardware
Q3_K_M~30 GBRTX 4090 (tight, ~12 tok/s), RTX 5090 (comfortable)
Q4_K_M~40 GBDual RTX 3090 / RTX 5090 + 8GB offload / Mac Studio M4 Ultra
Q5_K_M~48 GBDual RTX 4090 / Mac Studio M4 Ultra 64GB+
Q8_0~75 GBDual RTX 5090 / Mac Studio M4 Ultra 96GB+
FP16~140 GBServer territory

Run our VRAM calculator for your exact context length.

Real-world performance

Single-user, 2k context:

HardwareQuantTokens/sec
RTX 5090Q4_K_M (partial offload)~14 tok/s
Dual RTX 3090 (NVLink)Q4_K_M~16 tok/s
Dual RTX 4090Q5_K_M~22 tok/s
Mac Studio M4 Ultra 192GBQ4_K_M (MLX)~13 tok/s
Mac Studio M4 Ultra 192GBQ8_0 (MLX)~9 tok/s

How it compares

Llama 3.3 70BQwen3 30BMistral Small 3GLM-5 9B
MMLU86.083.475.278.1
HumanEval78.480.571.273.0
Q4 VRAM40 GB18 GB14 GB6 GB
Fits on 4090✗ (offload needed)
Best whentop quality, big VRAMbalanced prosumersmall + capablevery VRAM-constrained

Pick Llama 3.3 70B if you have ≥48 GB VRAM and want maximum open-weight quality. Pick Qwen3 30B if you have a single 24 GB GPU.

Frequently asked

Can I run it on a single RTX 4090?

Yes, at Q3_K_M with ~6 GB system RAM offload, ~12 tok/s. Usable but tight. For a 4090, Qwen3 30B is the better single-GPU pick.

What's better in 2026 - Llama 3.3 70B or Llama 4 Scout?

Llama 4 Scout (MoE 109B/17B active) is faster per-token but spikier in quality. Llama 3.3 70B is more consistent and has more ecosystem support. For most users in mid-2026, Llama 3.3 70B is still safer.

Is the 131k context actually usable?

Yes for retrieval/needle tasks up to ~80k tokens. Quality drift becomes noticeable past 100k. For reliable long-context, chunk and retrieve.

How to run

Ollama:

ollama run llama3.3:70b

vLLM (production):

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --quantization awq --tensor-parallel-size 2

Frequently asked

Quick answers to common questions

How much VRAM does Llama 3.3 70B need?

Llama 3.3 70B with 70B parameters needs approximately 40 GB at Q4_K_M quantization. Use our VRAM calculator for an exact estimate.

Is Llama 3.3 70B better than other Llama models?

Llama 3.3 70B scores 86 on MMLU and 78.4 on HumanEval. It has 70B parameters with 131,072 context - a strong choice for reasoning, coding, general-chat.

What license is Llama 3.3 70B under?

Llama 3.3 70B is released under the Llama 3.3 Community License license, making it suitable for most commercial and personal projects.

What hardware runs Llama 3.3 70B well?

With 70B parameters, Llama 3.3 70B requires adequate VRAM. High-end GPUs like the RTX 4090 (24GB), RTX 5090 (32GB), or Mac Studio with unified memory are good options. Check our hardware directory for specific recommendations.

What is the best quantization for Llama 3.3 70B?

Q4_K_M is the recommended sweet spot - ~98% of FP16 quality at ~27% of the size. Q5_K_M (~48 GB) is an option if you have spare VRAM. Use our VRAM calculator to compare.

How long can Llama 3.3 70B's context window handle?

Llama 3.3 70B supports a 131,072-token context window - enough for very long documents, codebases, or multi-turn conversations. Real-world usable context may vary by implementation.

What models compete with Llama 3.3 70B?

Llama 3.3 70B competes with other 35B–105B. Browse our model directory for comparisons, benchmarks, and community reviews to find the best fit.

Compare & pair with

Similar models and recommended hardware

Nearby options

Similar models and compatible hardware by spec

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.