Mistral Small 3
Mistral AI · Mistral family · Released January 2025
Mistral AI's January 2025 24B model — Apache 2.0, competitive with Llama 3.3 70B, fits on a single 24GB GPU.
Model card
| Developer | Mistral AI |
|---|---|
| Family | Mistral |
| License | Apache-2.0 |
| Modality | text |
| Parameters (B) | 24 |
| Context window | 32768 |
| Release date | January 2025 |
| Primary languages | en,fr,de,es,it,pt |
| Hugging Face | mistralai/Mistral-Small-24B-Instruct-2501 |
| Ollama | ollama pull mistral-small |
Mistral Small 3 drops: Apache-2.0 24B aimed at the single-GPU pleb
Mistral just released Mistral Small 3 — a 24-billion parameter dense transformer under the Apache 2.0 license, both base and instruct variants on Hugging Face today. The positioning is explicit in the release blog: match or exceed Llama 3.3 70B on most benchmarks, at roughly three times the inference speed, while fitting comfortably on a single 32GB GPU. The target customer is the sovereign pleb running local inference at home, not a cloud tenant burning someone else’s VRAM.
This is Mistral returning to its roots. After drifting toward Mistral Large under a proprietary license through 2024, Small 3 is a pointed reminder that Apache-2.0 is still the company’s signature release posture for the open line. The weights are on mistralai/Mistral-Small-24B-Instruct-2501 as of today. Below: what’s in the model, the benchmark snapshot, and the pleb VRAM math for running it on an actual home rig.
What’s in the weights
Mistral Small 3 is a 24B dense decoder-only transformer, built on the same Mistral architecture lineage that started with Mistral 7B (September 2023) and moved through Mixtral 8x7B (December 2023). It’s dense — not MoE — which is worth flagging because the rest of the 2024–2025 frontier pushed hard toward sparse mixtures. Mistral’s bet here is that a carefully trained dense 24B serves the single-GPU audience better than a sparse model whose total parameter count makes it awkward to host on one card.
Credit to the broader lineage: Transformer (Vaswani et al., 2017) → LLaMA and LLaMA 2 (Meta, 2023) → Mistral 7B’s Grouped-Query Attention and sliding-window attention work → Mistral Small 3 today. The architectural ideas that made Mistral 7B punch above its weight — GQA for cheap attention, a tight tokenizer, high-quality curated pretraining data — carry through at 24B scale.
Key specs from the release:
- 24B parameters, dense — no MoE routing overhead, predictable per-token compute
- Context window: 32K tokens — not frontier-long, but plenty for most home workflows
- Vocabulary: 131K tokens — updated Tekken tokenizer, broader multilingual coverage than earlier Mistral models
- Grouped-Query Attention for efficient inference
- License: Apache 2.0 — fully permissive, commercial use unrestricted, no revenue thresholds, no usage-policy carve-outs
- Variants: Mistral-Small-24B-Base-2501 (base) and Mistral-Small-24B-Instruct-2501 (instruction-tuned, function-calling-ready)
- Training approach: Mistral does not publish token counts or training data details for this release, but the instruct variant was post-trained with a focus on instruction following and native function/tool calling
The instruct variant ships with native function calling out of the box. For pleb workflows that wire an LLM into tool use — home automation controllers, RAG stacks that hit local APIs, scripted multi-step agents — that’s a meaningful quality-of-life upgrade over models where tool use has to be coaxed via prompt scaffolding.
Benchmarks at release
Primary-source numbers from the Hugging Face model card for the instruct variant:
- MMLU-Pro: 66.3 — competitive with Llama 3.3 70B (68.9) and ahead of many larger open models
- HumanEval (code): 84.8 — solid coding performance for a 24B
- GPQA Diamond: 45.3 — mid-tier for graduate-level STEM reasoning, below Phi-4 but above many peers at this size
- MATH: 70.6 — strong math, within striking distance of larger models
- MT-Bench: 8.35 — strong instruction-following and multi-turn chat quality
Mistral’s release blog claims “over 81%” on MMLU without publishing an exact number, so the primary-source value to trust is the MMLU-Pro on the HF card. MMLU-Pro is the harder, more discriminating variant of MMLU introduced in mid-2024 — a score of 66.3 at 24B density is genuinely strong and supports Mistral’s headline claim that Small 3 competes with 70B-class models on reasoning tasks.
Where Small 3 is expected to lag: long-context work above 32K (Qwen 2.5 and Llama 3.3 push to 128K), and the absolute top of any benchmark where a 70B dense or a 400B MoE still holds an edge. The value proposition isn’t “beats everything” — it’s “competitive at a fraction of the VRAM.”
The Mistral release blog claims Small 3 is “more than 3x faster than Llama 3.3 70B” on the same hardware, which is a per-token-throughput claim that tracks with the parameter-count math: a 24B dense model should run at roughly 2.9x the tokens-per-second of a 70B dense at equal precision on the same GPU. In practice, plebs running both models via llama.cpp on a 24GB card see the expected ratio — Small 3 at Q5 hits 40–50 tok/s on a 3090 where Llama 3.3 at Q4 hits 15–18 tok/s. That speed matters for interactive workloads where latency affects usability; it matters even more for batch workloads where you’re running many prompts through the model back-to-back.
Sovereign pleb implications
This is the single-GPU model of 2025 for plebs who’ve built a respectable home rig but can’t justify a multi-card setup. The VRAM math:
- fp16: about 48GB — dual 24GB cards or a single A6000/A100 40GB+
- Q8: about 24GB — clean on a used RTX 3090, 4090, or any 24GB card
- Q5_K_M: about 16GB — comfortable on a 16GB 4080 or a 3090/4090 with headroom for context and tools
- Q4_K_M: about 14GB — sweet spot for 16GB cards; fits on a 12GB card with minor offload
See the GGUF quantization guide for the quality trade-offs. On a 24B dense model, Q5–Q6 is usually the best price-performance balance if your VRAM allows.
What Small 3 replaces in a pleb stack:
- General assistant chat + tool use on a single 24GB card: Small 3 at Q8 is now the default. The combination of native function calling, Apache 2.0, and 3x the speed of Llama 3.3 70B makes it a practical daily driver.
- RAG backbones: Small 3’s 32K context and strong MMLU-Pro make it a reasonable RAG model, though Command R+ still wins on native grounded-citation formatting if you have the hardware.
- Code + STEM on one card: HumanEval 84.8 and MATH 70.6 put Small 3 firmly in “competent coding assistant” territory — not as strong as Qwen 2.5-Coder 32B for pure code, but versatile enough for mixed workloads.
For inference-as-heater builds, a single 3090 or 4090 running Small 3 at Q8 under sustained load is a 350–450W heat source producing genuinely frontier-adjacent work. For plebs converting decommissioned mining sites into AI Hashcenters, Small 3’s density-per-card profile is ideal for packing more concurrent sessions onto fewer GPUs than a 70B requires.
How to run it today
Weights are on Hugging Face at mistralai/Mistral-Small-24B-Instruct-2501. Ollama registry entry is live:
ollama pull mistral-small
New to Ollama? The 10-minute Ollama install guide covers setup. For chat UI, Open WebUI pairs with Ollama cleanly. LM Studio loads GGUF quants directly — Bartowski’s quants of Mistral-Small-24B-Instruct-2501 are already on Hugging Face. For production deployment, vLLM and SGLang both support the model natively via the standard Mistral architecture path.
Hitting issues? The self-hosted AI troubleshooting guide covers the usual VRAM, quantization, and driver snags.
What comes next
Mistral’s 2024 pattern was to pair each open release with a larger closed sibling. Expect Mistral Large 3 or a similar premium tier in the coming months — priced behind the API, with Small 3 as the pleb-facing complement. Community fine-tunes will appear on Hugging Face quickly given the permissive license; watch for a Mistral-Small-24B-Coder variant and domain-specific instruct tunes within weeks.
Bigger picture: the 24B Apache-2.0 release at the end of January 2025 is a pointed statement in a landscape where many open labs have been drifting toward restrictive licensing. It’s one more layer of decentralization in the open-weight ecosystem — a frontier-adjacent model that any pleb can run, modify, and ship commercially with no asterisks. That matters. For the case, see the Sovereign AI for Bitcoiners Manifesto; for related retrospectives, Llama 3.3 sits in the same weight class on a larger card, Phi-4 is the STEM-specialist alternative at 14B, and DeepSeek V3 is what happens when you go the opposite direction on scale. For the setup, the pleb’s guide to self-hosted AI and Bitcoin space heater pages cover the hardware side.
Benchmark history
Last benchmarked: January 30, 2025 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MMLU-Pro | 66.3 | vendor_blog ✓ | January 30, 2025 |
| MT-Bench | 8.35 | vendor_blog ✓ | January 30, 2025 |
| MATH | 70.6 | vendor_blog ✓ | January 30, 2025 |
| GPQA | 45.3 | vendor_blog ✓ | January 30, 2025 |
| HumanEval | 84.8 | vendor_blog ✓ | January 30, 2025 |
Recommended hardware
Runs well on 24 GB VRAM (3090 / 4090) at Q4–Q5. A used 3090 is the pleb pick.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
