Mixtral 8x7B
Mistral AI · Mistral family · Released December 2023
Mistral AI's December 2023 mixture-of-experts model — 8 experts, 2 active per token, Apache 2.0, ran at Llama-13B speed with Llama-70B quality.
Model card
| Developer | Mistral AI |
|---|---|
| Family | Mistral |
| License | Apache-2.0 |
| Modality | text |
| Parameters (B) | 46.7 (MoE) |
| Context window | 32768 |
| Release date | December 2023 |
| Primary languages | en,fr,de,es,it |
| Hugging Face | mistralai/Mixtral-8x7B-Instruct-v0.1 |
| Ollama | ollama pull mixtral |
Mixtral 8x7B: Mistral ships the first seriously usable open Mixture-of-Experts model
Mistral AI just released Mixtral 8x7B — a Sparse Mixture-of-Experts (SMoE) model with 46.7 billion total parameters but only 12.9 billion active per token — under Apache 2.0 license. The announcement went up this morning and the weights are live on Hugging Face. Three months after Mistral 7B beat Llama 2 13B at half the size, Mistral is back with a bigger architectural bet: sparse expert routing.
Mixture-of-Experts isn’t a new idea — Google’s Switch Transformer (2021), GShard (2020), and the academic MoE literature go back further — but this is the first time an MoE model at this scale has shipped with open weights on a permissive license. Mistral is claiming Mixtral outperforms Llama 2 70B on most benchmarks at roughly 6x faster inference. If that holds up, the open-weights landscape just shifted: a home-rig pleb can reach for 70B-class quality without paying 70B-class compute. Below: how MoE works, what the numbers say at release, and what this means for sovereign local inference.
What’s in the weights
Mixtral 8x7B is a decoder-only transformer where the dense feedforward layers have been replaced by Mixture-of-Experts layers. In each MoE layer, a small “router” network reads the incoming token and picks the top 2 of 8 expert feedforward sub-networks to process it. Only those 2 experts fire; the other 6 are skipped entirely for that token. Different tokens can route to different experts.
The lineage: Transformer (2017) → Switch Transformer and GShard (Google, 2020–2021) → LLaMA 1 and 2 (Meta, 2023) → Mistral 7B (September 2023) → Mixtral today. Mistral credits the MoE research at Google and the open implementations in the academic community as prior art. What’s new is the engineering: a small lab shipping a production-quality SMoE at 46.7B total parameters on a permissive license.
Key specs:
- 46.7B total parameters, 12.9B active per token (8 experts, top-2 routing)
- Base architecture: 32 transformer layers, decoder-only, with MoE feedforward blocks
- Same attention recipe as Mistral 7B: Grouped-Query Attention, Sliding Window Attention
- 32,000-token context window (a jump from Mistral 7B’s 8K)
- Tokenizer: SentencePiece BPE, 32K vocabulary
- Multilingual: strong in English, French, German, Spanish, Italian
- License: Apache 2.0 — permissive, commercial OK, no user-count clause
The practical mental model for MoE: you pay 46.7B parameters in VRAM and disk (because the router can pick any expert combination per token), but you pay 12.9B parameters in active compute per forward pass. VRAM scales with total params; inference speed scales with active params. That’s the key economic insight that makes MoE interesting for sovereign plebs.
Benchmarks at release
From Mistral’s release blog, published today:
- MMLU: Mixtral 8x7B at 70.6 vs Llama 2 70B at 69.9 and GPT-3.5 at 70.0 — Mixtral matches both at a fraction of the active compute.
- HellaSwag (10-shot): Mixtral at 86.7 vs Llama 2 70B at 87.1 — effectively tied.
- WinoGrande (5-shot): Mixtral at 81.2 vs Llama 2 70B at 83.2 — slight Llama edge.
- HumanEval (code, 0-shot pass@1): Mixtral at 40.2 vs Llama 2 70B at 29.9 — Mixtral is decisively ahead on code.
- MATH: Mixtral at 28.4 vs Llama 2 70B at 13.8 — another large Mixtral lead.
- GSM8K (maj@8): Mixtral at 74.4 vs Llama 2 70B at 54.9 — Mixtral ahead by nearly 20 points on grade-school math.
- Multilingual MMLU: Mixtral ahead on French, German, Spanish, Italian — the multilingual training shows.
Mistral is also claiming Mixtral matches or beats GPT-3.5 Turbo on most of these benchmarks. That’s the real headline: an open Apache 2.0 model is claiming parity with a closed flagship API. Expect the Open LLM Leaderboard and lmsys Chatbot Arena to validate or correct over the coming weeks — but the shape of the claim is credible given the architecture.
Sovereign pleb implications
This release changes what “local 70B-class model” means. Before today, running Llama 2 70B on home hardware meant ~40GB at Q4 — dual 3090s, minimum. Mixtral changes the math:
- Mixtral 8x7B at fp16: about 95GB. Multi-GPU or aggressive CPU offload territory — not a single-card model at full precision.
- Mixtral 8x7B at Q8: about 50GB. Doable on a dual-3090 rig (48GB) with light offload, clean on an 80GB A100.
- Mixtral 8x7B at Q5_K_M (GGUF): about 32GB. Fits on dual 3090s with room for long context. This is the sweet spot for home rigs.
- Mixtral 8x7B at Q4_K_M: about 27GB. Tight on dual 3090s with generous context. Workable on a single 32GB card, or single 24GB card with offload to system RAM.
- Mixtral 8x7B at Q3: about 20GB. Single-3090 territory with noticeable quality degradation — see our quant guide for the tradeoffs.
The speed side of the MoE equation is where it gets interesting. Because only 12.9B parameters are active per token, inference runs close to 13B-class speed — not 47B-class. On a dual-3090 rig running Mixtral at Q5, plebs are reporting 30–50 tokens/sec, well above what Llama 2 70B delivers on the same hardware.
What this replaces in a pleb stack: if you were running Llama 2 70B locally, Mixtral is a quality-and-speed upgrade at similar VRAM cost. If you were paying for GPT-3.5 via API, the open-weights alternative that matches it on benchmarks is on your disk today. If you were running Mistral 7B as a fast daily driver, Mixtral is the reasoning upgrade when your rig can afford the VRAM.
For inference heaters, Mixtral’s active-parameter economics are ideal: a dual-3090 rig under Mixtral load pulls roughly the same wattage as under Llama 2 70B load (~700W combined), but delivers more tokens of useful work per kWh. For Hashcenter operators considering inference workloads, MoE is the architecture that makes serving open models at scale economically competitive with closed APIs — because you pay total-param memory once and active-param compute per request.
How to run it today
Mixtral 8x7B is available today on Hugging Face:
- Base model: mistralai/Mixtral-8x7B-v0.1
- Instruct: mistralai/Mixtral-8x7B-Instruct-v0.1
MoE support in llama.cpp is hot off the press — Georgi Gerganov’s team merged the Mixtral kernels today, and community GGUF quantizations are being uploaded to Hugging Face right now. Expect an Ollama registry entry within 24–48 hours; once live, ollama pull mixtral will be the command. In the meantime, our 10-minute Ollama install guide gets you ready. For a chat UI, Open WebUI will pick up Mixtral automatically once Ollama has it.
LM Studio users: watch for Bartowski’s or TheBloke’s GGUF quants to appear in the HF browser over the next day. Running into VRAM errors? MoE routing adds a new failure mode (partial expert loading); see our troubleshooting guide.
What comes next
Mistral is clearly building a family: 7B dense, 8x7B sparse, and (by implication) larger MoE variants on the roadmap. The release blog hints at a Mixtral Instruct tuned with DPO — already published alongside the base. Community fine-tunes will appear within days now that MoE is supported in llama.cpp and the training tools are catching up. Expect coder variants, roleplay variants, and specialty multilingual tunes on Hugging Face within two weeks.
Bigger picture: Mixtral is the first production-quality open MoE. If it lives up to the release-day benchmarks, it’s going to push the whole open-weights ecosystem toward sparse architectures — because the active-parameter economics are that much better for inference at scale. For sovereign plebs, the headline is: 70B-class quality, 13B-class speed, Apache 2.0, on your disk tonight. Pull the weights, own the stack. See the Sovereign AI Manifesto and the pleb’s guide to self-hosted AI for the next steps.
Further reading: The same pleb-grade infrastructure that runs local inference also runs a Bitcoin space heater. Many readers arrive from the mining side — see From S19 to Your First AI Hashcenter for the bridge.
Benchmark history
Last benchmarked: December 11, 2023 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MATH | 28.4 | vendor_blog ✓ | December 11, 2023 |
| MT-Bench | 8.3 | vendor_blog ✓ | December 11, 2023 |
| HumanEval | 40.2 | vendor_blog ✓ | December 11, 2023 |
| MMLU | 70.6 | vendor_blog ✓ | December 11, 2023 |
Recommended hardware
Needs dual 3090 / 4090 for Q4, or a single 48 GB card (5090 / A6000) for headroom.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
