LlamaFeaturedLlama 3.3 Community Licensetext

Llama 3.3 70B

Updated Jul 23, 2026

Parameters

70B

Context

131,072

License

Llama 3.3 Community License

Updated

Jul 23, 2026

Intelligence benchmarks

Artificial Analysis indexes - compared with the best open and proprietary models

Intelligence

9.4

AA Index

Coding

11.9

AA Index

Agentic

0.3

AA Index

Intelligence Index - Llama 3.3 70B vs. the field

Best open-weight models (you can run locally) and leading proprietary models for context.

Claude Fable 5 (with fallback)

59.9

closed

GPT-5.6 Sol (max)

58.9

closed

Kimi K3

57.1

closed

Claude Opus 4.8 (max)

55.7

closed

GPT-5.6 Terra (max)

closed

GLM-5.2

51.1

open

MiniMax-M3

44.4

open

Llama 3.3 70B

9.4

open

Coding Index comparison

GPT-5.6 Sol (xhigh)

78.3

closed

GPT-5.6 Terra (max)

76.7

closed

Claude Fable 5 (with fallback)

76.5

closed

Kimi K3

76.2

closed

Claude Opus 4.8 (max)

74.3

closed

GLM-5.2

68.8

open

Kimi K2.7 Code

60.8

open

Llama 3.3 70B

11.9

open

Agentic Index comparison

GPT-5.6 Sol (max)

closed

Claude Fable 5 (with fallback)

52.8

closed

Kimi K3

50.1

closed

GPT-5.6 Terra (max)

47.4

closed

Claude Opus 4.8 (max)

47.2

closed

GLM-5.2

43.1

open

DeepSeek V4 Pro

36.4

open

Llama 3.3 70B

0.3

open

Benchmark data from Artificial Analysis · updated 2026-07-23.

Standard benchmarks

Performance across standard evaluations

Benchmark	Score
MMLU	86
HumanEval	78.4
MT-Bench	8.9
GSM8K	92.1
MMLUPRO	71.3
GPQA	49.8
AIME	7.7

Will it run on your hardware?

Pick your GPU memory - see which quantizations fit, and the cheapest card for the rest

Too big for 24 GB at any quant

0 of 4 quantizations fit Llama 3.3 70B with real runtime overhead.

your 24 GB

Q4_K_M

40 GB

needs Apple MacBook Pro 16" M4 Max ($3,299)

Q5_K_M

48 GB

needs Mac Studio M4 Ultra ($4,699)

Q8_0

75 GB

needs Mac Studio M4 Ultra ($4,699)

FP16

140 GB

needs Mac Studio M4 Ultra ($4,699)

fits tight too big

Need an exact figure for your context length? Use the VRAM calculator.

Run it locally

Copy-paste - running in under a minute

Ollamaeasiest

ollama run llama3.3:70b

vLLMOpenAI-compatible API

vllm serve meta-llama/Llama-3.3-70B-Instruct

New to this? Start with Ollama · serve to many users with vLLM.

Deep dive

Notes, sources, and the full write-up

Llama 3.3 70B

Short answer: Llama 3.3 70B is Meta's instruction-tuned 70B model with 131k context, ~86 MMLU, and Apache-style permissive licensing. In mid-2026 it remains the default "I need maximum quality without going to a 200B+ model" pick. Needs 40 GB VRAM at Q4_K_M - practical setups are dual RTX 3090, an RTX 5090, or any Mac Studio M4 Ultra.

Why it's still on the shortlist in 2026

Despite Llama 4 and the GLM-5 series existing, Llama 3.3 70B is the model with:

The most production deployments - every serving framework, every quant format, fully battle-tested
The most agent-tuning data - toolformer/function-call fine-tunes are mature
The most third-party fine-tunes on HuggingFace (coding, medical, legal verticals)
The most permissive 70B license - Meta's Llama 3.3 license allows commercial use up to 700M MAU
Best long-context behavior at 131k of any open 70B - needle-in-haystack stays >95% to 80k tokens

Benchmarks

Benchmark	Score	Notes
MMLU	86.0	Reasoning
HumanEval	78.4	Code generation
GSM8K	92.1	Math word problems
MT-Bench	8.9	Conversational quality

VRAM math

Quant	VRAM	Practical hardware
Q3_K_M	~30 GB	RTX 4090 (tight, ~12 tok/s), RTX 5090 (comfortable)
Q4_K_M	~40 GB	Dual RTX 3090 / RTX 5090 + 8GB offload / Mac Studio M4 Ultra
Q5_K_M	~48 GB	Dual RTX 4090 / Mac Studio M4 Ultra 64GB+
Q8_0	~75 GB	Dual RTX 5090 / Mac Studio M4 Ultra 96GB+
FP16	~140 GB	Server territory

Run our VRAM calculator for your exact context length.

Real-world performance

Single-user, 2k context:

Hardware	Quant	Tokens/sec
RTX 5090	Q4_K_M (partial offload)	~14 tok/s
Dual RTX 3090 (NVLink)	Q4_K_M	~16 tok/s
Dual RTX 4090	Q5_K_M	~22 tok/s
Mac Studio M4 Ultra 192GB	Q4_K_M (MLX)	~13 tok/s
Mac Studio M4 Ultra 192GB	Q8_0 (MLX)	~9 tok/s

How it compares

	Llama 3.3 70B	Qwen3 30B	Mistral Small 3	GLM-5 9B
MMLU	86.0	83.4	75.2	78.1
HumanEval	78.4	80.5	71.2	73.0
Q4 VRAM	40 GB	18 GB	14 GB	6 GB
Fits on 4090	✗ (offload needed)	✓	✓	✓
Best when	top quality, big VRAM	balanced prosumer	small + capable	very VRAM-constrained

Pick Llama 3.3 70B if you have ≥48 GB VRAM and want maximum open-weight quality. Pick Qwen3 30B if you have a single 24 GB GPU.

Frequently asked

Can I run it on a single RTX 4090?

Yes, at Q3_K_M with ~6 GB system RAM offload, ~12 tok/s. Usable but tight. For a 4090, Qwen3 30B is the better single-GPU pick.

What's better in 2026 - Llama 3.3 70B or Llama 4 Scout?

Llama 4 Scout (MoE 109B/17B active) is faster per-token but spikier in quality. Llama 3.3 70B is more consistent and has more ecosystem support. For most users in mid-2026, Llama 3.3 70B is still safer.

Is the 131k context actually usable?

Yes for retrieval/needle tasks up to ~80k tokens. Quality drift becomes noticeable past 100k. For reliable long-context, chunk and retrieve.

How to run

Ollama:

ollama run llama3.3:70b

vLLM (production):

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --quantization awq --tensor-parallel-size 2

Frequently asked

Quick answers to common questions

How much VRAM does Llama 3.3 70B need?

Llama 3.3 70B with 70B parameters needs approximately 40 GB at Q4_K_M quantization. Use our VRAM calculator for an exact estimate.

Is Llama 3.3 70B better than other Llama models?

Llama 3.3 70B scores 86 on MMLU and 78.4 on HumanEval. It has 70B parameters with 131,072 context - a strong choice for reasoning, coding, general-chat.

What license is Llama 3.3 70B under?

Llama 3.3 70B is released under the Llama 3.3 Community License license, making it suitable for most commercial and personal projects.

What hardware runs Llama 3.3 70B well?

With 70B parameters, Llama 3.3 70B requires adequate VRAM. High-end GPUs like the RTX 4090 (24GB), RTX 5090 (32GB), or Mac Studio with unified memory are good options. Check our hardware directory for specific recommendations.

What is the best quantization for Llama 3.3 70B?

Q4_K_M is the recommended sweet spot - ~98% of FP16 quality at ~27% of the size. Q5_K_M (~48 GB) is an option if you have spare VRAM. Use our VRAM calculator to compare.

How long can Llama 3.3 70B's context window handle?

Llama 3.3 70B supports a 131,072-token context window - enough for very long documents, codebases, or multi-turn conversations. Real-world usable context may vary by implementation.

What models compete with Llama 3.3 70B?

Llama 3.3 70B competes with other 35B–105B. Browse our model directory for comparisons, benchmarks, and community reviews to find the best fit.

Compare & pair with

Similar models and recommended hardware

Related models

qwen3-30b-a3b mistral-7b-instruct-v0-2

Recommended hardware

rtx-5090 rtx-3090 mac-studio-m4-ultra

Nearby options

Similar models and compatible hardware by spec

Fits on this hardware

Apple MacBook Pro 16" M4 Max - 48 GB VRAM NVIDIA RTX A6000 - 48 GB VRAM NVIDIA L40S - 48 GB VRAM NVIDIA RTX 6000 Ada Generation - 48 GB VRAM

Comments coming soon

Configure NEXT_PUBLIC_GISCUS_REPO_ID and NEXT_PUBLIC_GISCUS_CATEGORY_ID at giscus.app to enable.

Llama 3.3 70B

Intelligence benchmarks

Intelligence Index - Llama 3.3 70B vs. the field

Coding Index comparison

Agentic Index comparison

Standard benchmarks

Will it run on your hardware?

Run it locally

Deep dive

Llama 3.3 70B

Why it's still on the shortlist in 2026

Benchmarks

VRAM math

Real-world performance

How it compares

Frequently asked

Can I run it on a single RTX 4090?

What's better in 2026 - Llama 3.3 70B or Llama 4 Scout?

Is the 131k context actually usable?

How to run

Frequently asked

How much VRAM does Llama 3.3 70B need?

Is Llama 3.3 70B better than other Llama models?

What license is Llama 3.3 70B under?

What hardware runs Llama 3.3 70B well?

What is the best quantization for Llama 3.3 70B?

How long can Llama 3.3 70B's context window handle?

What models compete with Llama 3.3 70B?

Compare & pair with

Related models

Recommended hardware

Nearby options

Similar by size

Fits on this hardware