Llama 3.3
Meta · Llama family · Released December 2024
A single 70B model released December 2024, closing most of the gap to Llama 3.1 405B through improved post-training alone.
Model card
| Developer | Meta |
|---|---|
| Family | Llama |
| License | Llama 3.3 Community |
| Modality | text |
| Parameters (B) | 70 |
| Context window | 128000 |
| Release date | December 2024 |
| Primary languages | en,fr,de,es,it,pt,hi,th |
| Hugging Face | meta-llama/Llama-3.3-70B-Instruct |
| Ollama | ollama pull llama3.3 |
Llama 3.3 70B drops today: 405B performance at a fraction of the VRAM
Meta quietly released Llama 3.3 70B Instruct today — no big Connect keynote, no multimodal expansion, no new edge model. Just weights. And the claim is the interesting part: Meta says Llama 3.3 70B matches or exceeds Llama 3.1 405B on most standard benchmarks, at a fifth of the active parameters. For sovereign plebs who’ve been eyeing 405B but don’t have the Hashcenter to run it, that’s the news. The model card is live on Hugging Face.
What this tells you about where the open-weights space is going: the 2024 arms race is no longer “how many parameters can you train?” It’s “how efficient can you make a given parameter budget?” Today’s release is Meta’s strongest answer yet. Below: what they changed, what the benchmarks say at launch, and whether it’s worth pulling onto your rig tonight.
What’s in the weights
Llama 3.3 70B Instruct is a post-training refresh of the Llama 3.1 70B backbone. Meta did not retrain from scratch. What changed is the instruction-tuning pipeline — improved RLHF datasets, better reward modeling, and a new “offline” DPO (Direct Preference Optimization) stage that lets them sweep many preference variants without repeatedly running humans in the loop. The dense transformer architecture is unchanged from 3.1 70B: same 80 layers, same 8192 hidden dim, same SwiGLU, same RoPE, same GQA.
Credit the lineage: Transformer (2017, Vaswani et al.) → LLaMA 1 (February 2023) → Llama 2 (July 2023) → Llama 3 (April 2024) → Llama 3.1 (July 2024, which introduced the 405B flagship and 128K context) → Llama 3.2 (September, vision and edge) → Llama 3.3 today. The post-training techniques owe a debt to Anthropic’s Constitutional AI, OpenAI’s InstructGPT, and the academic RLHF / DPO literature.
Key specs:
- 70B parameters, dense transformer (not MoE)
- 128K context window
- Text-only, English-first, with support for 8 official languages
- Function calling and tool use, improved over 3.1 70B
- Instruction-tuned only — no new base model shipped today
Pretraining data is the same corpus as Llama 3.1 (15T tokens). This is a tuning release, not a scaling release. Whether you care about that depends on whether you think post-training has more headroom than pretraining does — and Meta’s betting that it does.
Benchmarks at release
From Meta’s model card:
- MMLU (5-shot): 3.3 70B at 86.0 vs 3.1 70B at 82.0 and 3.1 405B at 87.3 — the 3.3 70B is within a point of the 405B.
- GPQA Diamond (0-shot): 3.3 70B at 50.5 vs 3.1 405B at 50.7 — effectively tied on graduate-level science reasoning.
- HumanEval (code): 3.3 70B at 88.4 vs 3.1 405B at 89.0 — again effectively tied.
- MATH (0-shot CoT): 3.3 70B at 77.0 vs 3.1 405B at 73.8 — the 70B actually beats the 405B here, thanks to better math post-training.
- IFEval (instruction following): 3.3 70B leads 3.1 405B by a few points.
- Multilingual MGSM: 3.3 70B slightly behind 3.1 405B, but ahead of 3.1 70B.
The takeaway at release: on the tasks plebs actually use models for (code, math, instructions, general knowledge), Llama 3.3 70B is a 405B peer. On esoteric long-tail multilingual tasks, 405B still wins. The Open LLM Leaderboard and lmsys will give us the independent numbers within the week.
Sovereign pleb implications
This release is aimed squarely at the home-rig crowd. Running Llama 3.1 405B locally was, and remains, a nontrivial exercise: even at Q4 you’re looking at ~200GB of memory, which means a serious multi-GPU workstation or aggressive CPU offload through llama.cpp with a painful tokens-per-second hit. Llama 3.3 70B at Q4 is about 40GB — which means:
- Single RTX 3090 or 4090 (24GB): runs Q3 or a tight Q4_K_S comfortably. Usable quality, ~15–20 tokens/sec.
- Dual RTX 3090 (48GB total): the sweet spot. Full Q4_K_M loads with room for 32K+ context. 25–35 tokens/sec depending on batch size.
- Single H100 or A100 80GB: Q8 or fp16 shards, production-grade throughput. This is the small-Hashcenter tier.
- CPU + 64GB RAM: runs at Q4, 2–4 tokens/sec. Usable for background batch work, not interactive chat.
Check the GGUF quant guide for the Q4_K_M vs Q5_K_M vs Q6_K tradeoffs at this size — on 70B class, the quality drop from Q5 to Q4 is noticeable on long-form writing and multi-step reasoning; Q3 and below start to hurt.
What this replaces in the daily stack: if you were running Llama 3.1 70B, 3.3 is a drop-in upgrade — same hardware, better quality. If you were running 3.1 405B on a big rig for “flagship-at-home” quality, you can now retire it to weekly-batch duty and run 3.3 70B for daily chat at a fraction of the power. If you were paying for a frontier API for coding, 3.3 70B is the first open model where the quality gap closes hard enough that dropping the API subscription is a real decision, not a compromise.
For plebs heating with inference, the thermal math matters: a dual-3090 rig pushing Llama 3.3 70B at sustained load dissipates ~700W — enough to meaningfully heat a small office in winter. If you’re running a decommissioned S19 as a heater, swapping the hashboard for a GPU tray and running 3.3 is the cleanest pivot you can make right now.
How to run it today
Llama 3.3 is live on the Ollama registry as of today:
ollama pull llama3.3:70b
That pulls the default Q4_K_M, about 40GB. New to Ollama? The 10-minute install guide walks through everything. Pair with Open WebUI for a clean chat interface that supports function calling, which 3.3 handles well.
LM Studio also has 3.3 70B available via its Hugging Face browser — look for the Bartowski or lmstudio-community GGUF quants. For anyone building custom quants, the fp16 weights are on the official Meta Llama HF page. If you hit VRAM issues on first load, our troubleshooting guide covers the usual offload and context-length knobs.
What comes next
No 3.3 405B. No 3.3 8B. This was a 70B-only refresh. The reasonable read is that Meta is consolidating post-training wins into the size class that matters most for open deployment — and saving the bigger architectural news for the next flagship release. Rumors around a Llama 4 with MoE architecture have been circulating; nothing official today.
For sovereign plebs, the news is uncomplicated: frontier-ish capability at a size that fits on two used 3090s, with an Apache-adjacent license and same-day Ollama availability. Pull the weights, own the stack, close the API tab. See the Sovereign AI Manifesto for the broader case, and the pleb’s guide to self-hosted AI for the starting kit.
Benchmark history
Last benchmarked: December 6, 2024 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MMLU-Pro | 68.9 | vendor_blog ✓ | December 6, 2024 |
| MATH | 77 | vendor_blog ✓ | December 6, 2024 |
| GPQA | 50.5 | vendor_blog ✓ | December 6, 2024 |
| HumanEval | 88.4 | vendor_blog ✓ | December 6, 2024 |
| MMLU | 86 | vendor_blog ✓ | December 6, 2024 |
Recommended hardware
Needs dual 3090 / 4090 for Q4, or a single 48 GB card (5090 / A6000) for headroom.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
