Superseded

Llama 3.1

Name: Llama 3.1 model dataset
Creator: Meta
Published: 2024-07-23T10:00:00-05:00
License: https://d-central.tech/terms-and-conditions/

Meta · Llama family · Released July 2024

Meta's flagship 2024 open-weight LLM family — 8B, 70B, and 405B parameters with 128K context. The 405B was the first open-weight model at true frontier scale.

Model card

Developer	Meta
Family	Llama
License	Llama 3.1 Community
Modality	text
Parameters (B)	8,70,405
Context window	128000
Release date	July 2024
Primary languages	en,fr,de,es,it,pt,hi,th
Hugging Face	meta-llama/Llama-3.1-8B-Instruct
Ollama	`ollama pull llama3.1`

Meta dropped Llama 3.1 today, and the biggest open-weight model in the lineup—405B parameters—just put frontier-class capability in the hands of anyone with a GPU cluster and a git clone command. For sovereign plebs running inference at home, the real gift isn’t the 405B (you can’t run it without a hashcenter full of H100s). It’s the updated 8B and 70B siblings, both bumped to a 128K context window and dramatically improved across reasoning, tool use, and multilingual capability.

This is the release that made "open source catches up to closed" more than a meme. The 70B matches or exceeds GPT-4 on most public benchmarks, according to Meta’s release numbers. The 8B finally has a context window large enough for real work. And every weight is under the Llama 3.1 Community License—commercial-use permissive for organizations under 700M monthly active users, which covers approximately every pleb reading this.

What’s in the weights

Llama 3.1 is the direct descendant of a research lineage worth acknowledging. The Transformer architecture from Google’s 2017 "Attention Is All You Need" paper. Meta’s original LLaMA from February 2023—leaked, then reluctantly open-sourced under Llama 2 in July 2023. Llama 3 in April 2024 with the tokenizer jump to 128K vocab. And now Llama 3.1, a refinement rather than a revolution, but a meaningful one.

Three sizes ship today:

Llama 3.1 8B: Same parameter count as Llama 3 8B, but retrained with the expanded 128K context and improved instruction-following. Runs on a single consumer GPU with 16GB+ VRAM at reasonable quants.
Llama 3.1 70B: The pleb flagship. 128K context, materially stronger reasoning than Llama 3 70B. Fits on dual RTX 3090s at Q4_K_M quantization.
Llama 3.1 405B: The headline act. First open-weight model to genuinely challenge GPT-4 and Claude 3.5 Sonnet on reasoning benchmarks, per Meta’s release post. Requires enterprise-grade inference infrastructure—not a home-pleb model.

Architecturally, it’s a standard decoder-only Transformer with Grouped Query Attention, SwiGLU activations, and RoPE positional embeddings extended via a custom scaling technique described in Meta’s technical blog post, enabling 128K context without catastrophic degradation. The 405B was trained on over 15 trillion tokens on Meta’s 16,000-H100 cluster—a compute budget that would bankrupt most nation-states.

Training data details remain partially disclosed, as usual. Meta acknowledges a mix of publicly available text, code-heavy corpora, and synthetic data generated by earlier Llama checkpoints. Multilingual coverage expanded significantly from Llama 3, with stronger performance claimed for Spanish, Portuguese, German, French, Hindi, Italian, and Thai.

The 8B and 70B are distilled from the 405B teacher model, a first for the Llama family. Meta claims this transfer dramatically improved the smaller models’ capabilities without changing their parameter counts. For plebs, this means the 70B you download today is meaningfully stronger than Llama 3 70B, even though the sizes look identical on paper.

Benchmarks at release

Per Meta’s release blog, Llama 3.1 405B scores competitive with GPT-4o and Claude 3.5 Sonnet on the following benchmarks at release:

MMLU (general knowledge): 87.3 for 405B, 86.0 for 70B, 73.0 for 8B
HumanEval (code): 89.0 for 405B, 80.5 for 70B, 72.6 for 8B
MATH: 73.8 for 405B, 68.0 for 70B
GSM8K (grade-school math): 96.8 for 405B, 95.1 for 70B, 84.5 for 8B

Independent evaluators on the LMSys Chatbot Arena will take weeks to gather enough votes for a stable ranking, and community benchmarks on Hugging Face’s Open LLM Leaderboard will roll in over the next month. Treat Meta’s self-reported numbers with appropriate skepticism until the community reproduces them, but the architectural and training-scale improvements make the claims plausible.

For plebs running local inference, the most important benchmark isn’t a leaderboard—it’s whether the 70B at Q4 quantization feels sharp enough to replace your daily OpenAI habit. Early community reports suggest yes.

What it means for the sovereign pleb

The sovereign AI manifesto has been arguing that closed frontier labs will always be a rent-extracting bottleneck. Llama 3.1 70B is the first open-weight model that makes that argument tangible rather than aspirational. On dual RTX 3090s, a Q4_K_M quant of Llama 3.1 70B delivers 10–15 tokens per second and quality that covers roughly 80% of real pleb use cases—coding help, long-context summarization, research assistance, drafting. The remaining 20% (frontier reasoning, agentic tasks, multimodal) will still push you to closed models for now, but the gap has never been smaller.

VRAM requirements at common quantization levels:

Llama 3.1 8B Q4_K_M: ~5GB VRAM — runs on an RTX 3060 12GB, a Mac M-series with 16GB unified memory, or any 8GB+ GPU with headroom to spare
Llama 3.1 8B Q8: ~9GB VRAM — near-FP16 quality on a 12GB card
Llama 3.1 70B Q4_K_M: ~40GB VRAM — dual 3090/4090 or a single A6000
Llama 3.1 70B Q5_K_M: ~49GB VRAM — pushes dual 3090 to the limit, prefer dual 4090 or A6000
Llama 3.1 70B Q8: ~75GB VRAM — quad 3090, dual A6000, or H100
Llama 3.1 405B Q4_K_M: ~240GB VRAM — not a pleb model. Eight H100s or a dedicated inference server.

If you’re building a Hashcenter—recycling mining heat into compute work and selling the compute while you heat your home—the 70B is the sweet spot. It justifies the dual-GPU setup you’d buy anyway for Stable Diffusion and FLUX, and it replaces enough ChatGPT plebeian tasks to make the sovereignty worth the electricity. For the used RTX 3090 stack we recommend for plebs, Llama 3.1 70B Q4 is the new default.

If you’re running a single-GPU rig, the 8B is now the default small model. It supplants Llama 3 8B, Mistral 7B, and most of the older "small-and-capable" tier in one release. For reference on which quant to grab, see our quantization explainer—Q4_K_M is still the right default for plebs who want quality-to-VRAM ratio, and Q8 is the right answer for people running 8B on a 24GB card who want near-FP16 fidelity.

How to run it today

Quickstart with Ollama:

ollama pull llama3.1:8b

ollama pull llama3.1:70b

ollama run llama3.1:70b

Ollama pulls Q4_K_M by default, which is correct for most plebs. If you want a different quant, use the explicit tag:

ollama pull llama3.1:70b-instruct-q5_K_M

For the Hugging Face weights directly: meta-llama/Meta-Llama-3.1-70B-Instruct (you’ll need to accept the license on HF to pull the weights). The GGUF quants from community maintainers like bartowski tend to land within hours of release and are the pleb-preferred source for most local runners.

LM Studio users: the model should appear in the in-app search today or tomorrow. See our LM Studio vs Ollama vs llama.cpp comparison if you’re deciding which runner to use. For a browser chat UI on top of Ollama, Open WebUI remains the pleb standard.

If inference heat is a feature, not a bug, for your setup, heating with inference has the math on why a 70B on dual 3090s throws off enough heat to supplement a small room’s baseboard. And if you’re spinning up serious inference capacity from retired miners, the S19-to-AI-Hashcenter playbook is a better primer than anything Sam Altman will ever write. For broader market context on where this is all heading, the Hashcenter pivot thesis tracks the capital flows reshaping the industry.

What comes next

Meta has said explicitly that the next Llama will be multimodal—text, image, and potentially audio in a single model. Mark Zuckerberg’s open letter published alongside this release frames open-source AI as a "path forward" that Meta intends to keep investing in, citing the same Linux-vs-proprietary-Unix analogy that sovereign AI advocates have been making for a year. Take the corporate motives with a grain of salt, but the release cadence speaks for itself: Llama 1 to Llama 2 took five months; Llama 2 to Llama 3 took nine; Llama 3 to Llama 3.1 took three.

If you run into issues loading the weights or configuring your runtime, our self-hosted AI troubleshooting guide covers the common failure modes. For plebs who want to integrate the model into a local automation stack, the Home Assistant and Obsidian integration guide has patterns that work well with Llama 3.1 8B as the always-on classifier.

For plebs, the message today is simple. The 70B is the new daily driver. Pull the weights, spin up Ollama, and run your own inference in your own Hashcenter. The frontier labs can keep their API keys and their rate limits. Llama 3.1 is yours to run, to modify, and to keep.

Sovereignty was always going to be an open-weight game. Today it just got a lot more plausible.

Benchmarks tracked

GPQA HumanEval MATH MMLU MT-Bench

Benchmark history

Last benchmarked: July 23, 2024 Needs refresh

Benchmark	Score	Source	Measured
MMLU-Pro	73.3	vendor_blog ✓	July 23, 2024
MATH	73.8	vendor_blog ✓	July 23, 2024
GPQA	50.7	vendor_blog ✓	July 23, 2024
HumanEval	89	vendor_blog ✓	July 23, 2024
MMLU	87.3	vendor_blog ✓	July 23, 2024

Recommended hardware

Multi-GPU rig or cloud territory. For most plebs, the 70B distillation is plenty.

Buying guide: used RTX 3090 for LLMs (2026) →

Get it running

01 Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
02 Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
03 Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.

Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.