Skip to content

We're upgrading our operations to serve you better. Orders ship as usual from Laval, QC. Questions? Contact us

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Current

Llama 4 (Scout/Maverick)

Meta · Llama family · Released April 2025

Meta's April 2025 MoE-and-multimodal release, headlined by Scout's 10M-token context window and the pre-announced Behemoth frontier model.

Model card

DeveloperMeta
FamilyLlama
LicenseLlama 4 Community
Modalitytext+vision
Parameters (B)varies (MoE)
Context window10000000
Release dateApril 2025
Primary languagesen,fr,de,es,it,pt,hi,th,ar
Hugging Facemeta-llama/Llama-4-Scout-17B-16E-Instruct
Ollamaollama pull llama4

Llama 4 drops today: Meta goes MoE, native multimodal, and Scout fits on a single H100

Meta just released Llama 4, and the lineage took a hard left. For the first time in the flagship Llama line, the weights are a Mixture-of-Experts (MoE) architecture, not a dense transformer. Two models ship today: Llama 4 Scout (17B active, 109B total, 16 experts) and Llama 4 Maverick (17B active, 400B total, 128 experts). A third, Behemoth, is still training and will drop later. Both Scout and Maverick are natively multimodal (text + image) out of the box, and both are available today on llama.com and Hugging Face.

If you were expecting another dense 70B like Llama 3.3, re-calibrate. The new design question is no longer “how big is the model?” — it’s “how much of the model actually fires per token?” For plebs running GPUs at home, that shift matters more than any benchmark headline, because MoE changes the VRAM math. We’ll unpack all of it below: what Meta shipped, what the numbers look like at launch, and what a sovereign stack looks like when the flagship open model is suddenly 109B parameters with a 10M context window. This is release-day analysis — official numbers only, same-day availability, same-day running.

What’s in the weights

Llama 4 is Meta’s first production deployment of Mixture-of-Experts in the open flagship line. Credit where it’s due: the MoE idea isn’t new. Google’s Switch Transformer (2021), Mistral’s Mixtral 8x7B (December 2023), and DeepSeek’s DeepSeek V3 (December 2024) all proved sparse expert routing at scale. What’s different today is that Meta — the single largest open-weights distributor on earth — has made MoE the default shape of Llama going forward. That’s the lineage moving: Transformer (2017) → LLaMA 1 (2023) → Llama 2 → Llama 3 / 3.1 / 3.2 / 3.3 → Llama 4.

Scout (17B active / 109B total)

  • 16 experts, two active per token
  • 17B active parameters at inference
  • 10M token context window (yes, ten million — Meta’s headline number)
  • Natively multimodal: early-fusion vision + text training
  • Fits on a single NVIDIA H100 at Int4 quantization, per Meta’s model card

Maverick (17B active / 400B total)

  • 128 experts, two active per token
  • 17B active parameters at inference (same as Scout)
  • 1M token context window
  • Natively multimodal
  • Targets a single H100 host with expert sharding

Both models were pre-trained on roughly 30 trillion tokens of mixed text, code, and image data — more than double Llama 3’s training corpus. Meta is positioning Maverick as a GPT-4o / Claude 3.7 / Gemini 2.0 peer, with Scout as the “edge flagship” for local inference. The 10M context in Scout is the number everyone’s going to be talking about tonight — Meta says it’s achieved with an iRoPE (interleaved RoPE) architecture change. Whether it holds up under retrieval stress tests is something the community will pressure-test over the next few weeks.

Benchmarks at release

These are the numbers Meta published today in the release blog. We’re not adding speculation, and we’re not waiting for lmsys — we’re reading what the creator shipped with the weights.

  • Maverick vs GPT-4o / Gemini 2.0 Flash: Meta claims Maverick beats both on most reasoning and coding benchmarks, and matches the newer DeepSeek V3.1 on the same tasks — at roughly half the active parameters.
  • Scout vs Llama 3.3 70B: Meta claims Scout matches or exceeds 3.3 70B on the standard MMLU / GSM8K / HumanEval suite while running at ~5x the tokens/sec thanks to the 17B active path.
  • STEM benchmarks: on MATH and GPQA, Scout’s 17B active path punches well above its weight class — Meta’s numbers put it ahead of GPT-4.5 preview on several STEM tasks at release.
  • Multilingual: Llama 4 was trained on 200 languages, with full support for 12. The MMLU-Pro multilingual results put it competitive with Qwen 2.5 72B, which had been the open multilingual leader.

As always with release-day numbers: the creator picked the benchmarks, so read them as “best-case officially-endorsed.” The lmsys arena and the Hugging Face Open LLM Leaderboard will sort out the real ranking over the next 30 days. But the shape of the claim — MoE matching dense frontier models at a fraction of active compute — is consistent with what Mixtral showed in 2023 and DeepSeek V3 showed in December. The architecture works.

Sovereign pleb implications

Here’s where the rubber meets the road for a Hashcenter running Llama locally. MoE breaks the old VRAM rule of thumb. Under a dense model like Llama 3.3 70B, you needed to hold all 70B parameters in VRAM — roughly 40GB at Q4, meaning a pair of used RTX 3090s (48GB combined) was the minimum sovereign rig. Under Llama 4 Scout, you need to hold all 109B parameters resident (because the router can pick any expert per token), but only 17B are active per forward pass. The VRAM bill is 109B total; the compute bill is 17B active.

Translation for the typical pleb rig:

  • Scout at Q4_K_M (GGUF): roughly 60–65GB on disk, so you need a minimum of two 3090s plus ~16GB of system RAM spillover, or a single H100/A100 80GB. This is the new “comfortable local flagship” tier.
  • Scout at Q2/Q3: gets you down into the 40GB range, runnable on a dual-3090 rig comfortably, at the cost of noticeable quality degradation. See our GGUF quant guide for the tradeoffs.
  • Maverick: 400B total parameters means this is not a home-rig model unless you have a multi-GPU workstation or you’re running a small Hashcenter. Realistically, Maverick is an 8x H100 deployment or a heavily offloaded llama.cpp build.
  • Behemoth (when it drops): likely Hashcenter-only.

What does Llama 4 Scout replace in the daily stack? For plebs who had been running Llama 3.3 70B on dual 3090s for chat + coding, Scout is a direct upgrade: same hardware, more capability, plus native vision. For plebs running Qwen 2.5 72B as their multilingual daily driver, Scout is worth a head-to-head trial — the multilingual claim is the first real open challenge to Qwen’s territory. For plebs running Llama 3.1 8B for fast local chat, Scout is probably overkill; stay on 8B and watch for a Scout-mini variant.

The Hashcenter pivot crowd will care about two numbers: Maverick’s claimed parity with GPT-4o at 17B active, and the 10M context on Scout. If those hold up in production, open weights just closed a meaningful gap with the frontier APIs — and the economics of running your own inference stack (on decommissioned S19 fleets or a small GPU build) get a lot more interesting. See our write-up on converting mining sites to AI inference for the power-envelope math.

How to run it today

Scout hits the Ollama registry at release. For the typical pleb workflow:

ollama pull llama4:scout

That pulls the default Q4_K_M quantization — roughly 65GB download, ready to run. If you haven’t set up Ollama yet, our 10-minute Ollama install guide walks through the full process, including GPU detection and VRAM offload tuning. For a nicer chat interface, pair it with Open WebUI — the vision support in Scout lands cleanly in that UI as of today.

For plebs who prefer a GUI inference loader, LM Studio is already pulling Llama 4 GGUFs from Hugging Face — the 109B Scout GGUFs are being uploaded by the community right now. Give it a few hours for the Q4 quants to settle. The Meta Llama org on Hugging Face has the official fp16 weights for anyone building custom quantizations. If inference crashes on first load, check our self-hosted AI troubleshooting guide — the usual suspects are CUDA version and VRAM overcommit, and Llama 4’s MoE routing adds one new failure mode (missing expert weights, usually a partial download).

What comes next

Meta pre-announced Behemoth today — a 2T-parameter MoE still in training — but gave no release window. Expect Scout and Maverick derivatives (instruct tunes, tool-calling variants, distills) to flood Hugging Face over the next two weeks; the community always iterates fast on a fresh Llama drop. The interesting open question is whether the 10M context holds up in practice: Scout’s iRoPE design is new, and retrieval benchmarks at that length have historically exposed positional encoding weaknesses. We’ll know in a week.

Bigger picture: with Llama 4, Meta formally moved open-weights from “dense transformers” to “sparse MoE” as the default flagship shape. That’s a lineage shift. Mixtral proved it, DeepSeek scaled it, Meta just made it the mainstream open default. Sovereign plebs win when frontier architectures ship with permissive licenses and land on the Ollama registry the same day — and today, that’s what happened. Pull it, run it, own your inference. See the Sovereign AI Manifesto for the why, and the pleb’s guide to self-hosted AI for the how.

Benchmark history

Last benchmarked: April 5, 2025 Needs refresh

Benchmark Score Source Measured
MMLU-Pro 80.5 vendor_blog  ✓ April 5, 2025
GPQA 69.8 vendor_blog  ✓ April 5, 2025

Get it running

  1. 01 Install Ollama →

    Ten-minute local LLM runtime. One binary, zero cloud.

  2. 02 Give it a web UI →

    Open-WebUI turns Ollama into a self-hosted ChatGPT.

  3. 03 Understand quantization →

    GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.

Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.