Mistral 7B
Mistral AI · Mistral family · Released September 2023
Mistral AI's September 2023 debut — a 7B Apache-2.0 model that popularized Grouped-Query and Sliding Window Attention.
Model card
| Developer | Mistral AI |
|---|---|
| Family | Mistral |
| License | Apache-2.0 |
| Modality | text |
| Parameters (B) | 7 |
| Context window | 32768 |
| Release date | September 2023 |
| Primary languages | en,fr |
| Hugging Face | mistralai/Mistral-7B-Instruct-v0.3 |
| Ollama | ollama pull mistral |
Mistral 7B: a small European lab just shipped an open model that beats Llama 2 13B
A Paris startup most people hadn’t heard of six months ago just released Mistral 7B — and it outperforms Meta’s Llama 2 13B on every benchmark Mistral tested, while being roughly half the size. The model landed today under an Apache 2.0 license, which is about as permissive as it gets: commercial use, modification, redistribution, no gating. The official announcement went up this morning, with weights on GitHub and Hugging Face.
For context: Mistral AI was founded earlier this year by former Meta and DeepMind researchers and raised a record-setting $118M seed round in June. Mistral 7B is their first public model — and they’re stepping into a space that’s been dominated by Meta (Llama 2, July 2023) and OpenAI (closed GPT-3.5/4) with a clear capability-per-parameter pitch. Today’s release validates that pitch with numbers. Below: what’s in the model, what the benchmarks say, and what this means for plebs who’ve been running Llama 2 7B or 13B on home rigs and wondering if there’s a better daily driver.
What’s in the weights
Mistral 7B is a 7.3-billion parameter decoder-only transformer. The architectural lineage is the familiar one: Transformer (Vaswani et al., 2017) → GPT-style decoder stacks → LLaMA 1 (Meta, February 2023) → Llama 2 (Meta, July 2023) → Mistral 7B today. The Mistral team took the Llama architecture as a starting point and made two specific changes that matter.
Grouped-Query Attention (GQA). Instead of full multi-head attention, Mistral 7B groups attention heads so multiple query heads share a single key/value head. This cuts the KV cache size by roughly 4x, which means faster inference and dramatically less VRAM pressure at long context. GQA was introduced in the Llama 2 70B and has been creeping into smaller models — Mistral brought it to the 7B tier.
Sliding Window Attention (SWA). Each token attends only to the previous 4,096 tokens, not to every token in the context. This caps the attention complexity linearly rather than quadratically, which lets the model handle theoretically unbounded context at a fraction of the memory cost. Mistral claims effective context up to 16K+ tokens through SWA’s “effective” reach across stacked layers.
Key specs:
- 7.3B parameters, decoder-only transformer
- Sliding window: 4,096 tokens
- Nominal context: 8,192 tokens with longer effective reach via SWA
- Apache 2.0 license — fully permissive, commercial OK
- Tokenizer: SentencePiece BPE, 32K vocabulary
- Training data and compute: not disclosed in the release blog
The Apache 2.0 license is the second headline of the day. Llama 2’s “Community License” isn’t OSI-approved open source and has restrictions (the 700M monthly active user clause, acceptable use policies). Mistral 7B has none of that. It’s the most permissive flagship-tier open model released to date.
Benchmarks at release
From Mistral’s release blog:
- MMLU (5-shot): Mistral 7B at 60.1 vs Llama 2 7B at 44.4 and Llama 2 13B at 55.6 — Mistral 7B beats the 13B by 4.5 points at roughly half the size.
- HellaSwag (commonsense reasoning): Mistral 7B at 81.3 vs Llama 2 13B at 80.7.
- WinoGrande: Mistral 7B at 75.3 vs Llama 2 13B at 72.9.
- HumanEval (code, pass@1): Mistral 7B at 30.5 vs Llama 2 13B at 18.3 — a large gap on code.
- MATH: Mistral 7B at 13.1 vs Llama 2 13B at 6.0 — another large gap on math.
- GSM8K: Mistral 7B at 52.2 vs Llama 2 13B at 28.7 — the biggest reasoning delta.
Mistral also benchmarked against Llama 1 34B (the largest Llama 1) and showed Mistral 7B ahead on reasoning, comprehension, and code, while trailing only slightly on broad knowledge. That’s a 7B model beating a 34B model from one generation back. These are creator-published numbers — the Open LLM Leaderboard will have independent numbers in a few days.
Sovereign pleb implications
This is the “finally, a home model that’s actually good” release. Here’s the practical VRAM math for plebs running local inference:
- Mistral 7B at fp16: about 15GB. Fits on a single used RTX 3090 (24GB) or a 16GB 4060 Ti with minor offload.
- Mistral 7B at Q8: about 8GB. Fits on a 12GB 3060 or a 16GB card with plenty of room for long context.
- Mistral 7B at Q4_K_M (GGUF): about 4.2GB. Runs on almost any GPU with 6GB+, and runs well on CPU-only with 16GB system RAM.
For GGUF quantization choices, see our guide — on a 7B, Q4_K_M is a very reasonable daily driver; Q5_K_M is noticeably sharper if you can spare the VRAM. Mistral 7B’s smaller footprint (vs Llama 2 13B) means Q5 on a 12GB card is easy.
What this replaces in a pleb stack: if you were running Llama 2 7B for chat, swap to Mistral 7B — same hardware, meaningfully better quality. If you were running Llama 2 13B on a single 3090 (Q4) or dual cards (higher precision), Mistral 7B at Q5 or Q8 gives you better quality at lower VRAM cost. For plebs who had been CPU-only because their GPU couldn’t hold 13B at Q4, Mistral 7B at Q4 is easily within reach on a mid-tier card — and the tokens/sec jump from CPU to GPU is the difference between “usable for light chat” and “real daily driver.”
For the inference-as-heater crowd, a single GPU pushing Mistral 7B at sustained load dissipates about 250–350W depending on card — a fine room-heater profile, and now with a model good enough to be doing genuinely useful work rather than just pattern-matching. The economics of a mining-rig-to-AI-node conversion shift modestly: smaller models mean you can pack more concurrent users onto decommissioned hardware.
The Apache 2.0 license matters for plebs who want to fine-tune and redistribute. You can train a specialty Mistral 7B variant, publish it, sell it, wrap it in a product — with no gating clause to read. That’s going to unleash a wave of derivative models in the coming weeks.
How to run it today
Mistral 7B is available today via:
- Hugging Face: the official mistralai/Mistral-7B-v0.1 repo has fp16 weights and the base + instruct variants.
- GitHub: Mistral published a minimal reference inference implementation alongside the weights.
- llama.cpp and Ollama: the community is already converting and publishing GGUF quantizations. Expect an
ollama pull mistralregistry entry within a day or two, if not already live by evening.
Once it’s on Ollama, the pull command will be straightforward. If you haven’t set up Ollama yet, our 10-minute install guide gets you from zero to running. For a clean chat UI, pair with Open WebUI. Building on your own stack? LM Studio, Ollama, or llama.cpp will all be good targets once GGUF quants are live.
What comes next
Mistral’s release blog is explicit that a larger model is coming — “stay tuned for larger models.” They didn’t give a size or a date, but the pattern suggests an instruction-tuned Mistral 7B variant in days, and a bigger base model in the months ahead. Expect the community to publish fine-tunes (dolphin, openhermes, zephyr-class) within two weeks; the Apache 2.0 license removes the friction that slowed Llama 2 derivatives.
For sovereign plebs, this release is straightforward good news: a small, fast, permissively licensed model that beats previous-generation giants. Own the hardware, pull the weights, close the API tab. See our Sovereign AI Manifesto for the broader case, and the pleb’s guide to self-hosted AI for the next step. Mistral just showed that a small European lab can ship frontier-class open weights on Apache 2.0 in its first release. That’s a healthy sign for the whole sovereign-AI ecosystem.
Benchmark history
Last benchmarked: September 27, 2023 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MATH | 13.1 | vendor_blog ✓ | September 27, 2023 |
| MT-Bench | 6.84 | vendor_blog ✓ | September 27, 2023 |
| HumanEval | 30.5 | vendor_blog ✓ | September 27, 2023 |
| MMLU | 60.1 | vendor_blog ✓ | September 27, 2023 |
Recommended hardware
Runs on 12 GB VRAM — 3060 Ti / 4060 / M2 territory. Sweet spot for home rigs.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
