Skip to content

We're upgrading our operations to serve you better. Orders ship as usual from Laval, QC. Questions? Contact us

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Current

Mistral Small 3

Mistral AI · Mistral family · Released January 2025

Mistral AI's January 2025 24B model — Apache 2.0, competitive with Llama 3.3 70B, fits on a single 24GB GPU.

Model card

DeveloperMistral AI
FamilyMistral
LicenseApache-2.0
Modalitytext
Parameters (B)24
Context window32768
Release dateJanuary 2025
Primary languagesen,fr,de,es,it,pt
Hugging Facemistralai/Mistral-Small-24B-Instruct-2501
Ollamaollama pull mistral-small

Mistral Small 3 drops: Apache-2.0 24B aimed at the single-GPU pleb

Mistral just released Mistral Small 3 — a 24-billion parameter dense transformer under the Apache 2.0 license, both base and instruct variants on Hugging Face today. The positioning is explicit in the release blog: match or exceed Llama 3.3 70B on most benchmarks, at roughly three times the inference speed, while fitting comfortably on a single 32GB GPU. The target customer is the sovereign pleb running local inference at home, not a cloud tenant burning someone else’s VRAM.

This is Mistral returning to its roots. After drifting toward Mistral Large under a proprietary license through 2024, Small 3 is a pointed reminder that Apache-2.0 is still the company’s signature release posture for the open line. The weights are on mistralai/Mistral-Small-24B-Instruct-2501 as of today. Below: what’s in the model, the benchmark snapshot, and the pleb VRAM math for running it on an actual home rig.

What’s in the weights

Mistral Small 3 is a 24B dense decoder-only transformer, built on the same Mistral architecture lineage that started with Mistral 7B (September 2023) and moved through Mixtral 8x7B (December 2023). It’s dense — not MoE — which is worth flagging because the rest of the 2024–2025 frontier pushed hard toward sparse mixtures. Mistral’s bet here is that a carefully trained dense 24B serves the single-GPU audience better than a sparse model whose total parameter count makes it awkward to host on one card.

Credit to the broader lineage: Transformer (Vaswani et al., 2017) → LLaMA and LLaMA 2 (Meta, 2023) → Mistral 7B’s Grouped-Query Attention and sliding-window attention work → Mistral Small 3 today. The architectural ideas that made Mistral 7B punch above its weight — GQA for cheap attention, a tight tokenizer, high-quality curated pretraining data — carry through at 24B scale.

Key specs from the release:

  • 24B parameters, dense — no MoE routing overhead, predictable per-token compute
  • Context window: 32K tokens — not frontier-long, but plenty for most home workflows
  • Vocabulary: 131K tokens — updated Tekken tokenizer, broader multilingual coverage than earlier Mistral models
  • Grouped-Query Attention for efficient inference
  • License: Apache 2.0 — fully permissive, commercial use unrestricted, no revenue thresholds, no usage-policy carve-outs
  • Variants: Mistral-Small-24B-Base-2501 (base) and Mistral-Small-24B-Instruct-2501 (instruction-tuned, function-calling-ready)
  • Training approach: Mistral does not publish token counts or training data details for this release, but the instruct variant was post-trained with a focus on instruction following and native function/tool calling

The instruct variant ships with native function calling out of the box. For pleb workflows that wire an LLM into tool use — home automation controllers, RAG stacks that hit local APIs, scripted multi-step agents — that’s a meaningful quality-of-life upgrade over models where tool use has to be coaxed via prompt scaffolding.

Benchmarks at release

Primary-source numbers from the Hugging Face model card for the instruct variant:

  • MMLU-Pro: 66.3 — competitive with Llama 3.3 70B (68.9) and ahead of many larger open models
  • HumanEval (code): 84.8 — solid coding performance for a 24B
  • GPQA Diamond: 45.3 — mid-tier for graduate-level STEM reasoning, below Phi-4 but above many peers at this size
  • MATH: 70.6 — strong math, within striking distance of larger models
  • MT-Bench: 8.35 — strong instruction-following and multi-turn chat quality

Mistral’s release blog claims “over 81%” on MMLU without publishing an exact number, so the primary-source value to trust is the MMLU-Pro on the HF card. MMLU-Pro is the harder, more discriminating variant of MMLU introduced in mid-2024 — a score of 66.3 at 24B density is genuinely strong and supports Mistral’s headline claim that Small 3 competes with 70B-class models on reasoning tasks.

Where Small 3 is expected to lag: long-context work above 32K (Qwen 2.5 and Llama 3.3 push to 128K), and the absolute top of any benchmark where a 70B dense or a 400B MoE still holds an edge. The value proposition isn’t “beats everything” — it’s “competitive at a fraction of the VRAM.”

The Mistral release blog claims Small 3 is “more than 3x faster than Llama 3.3 70B” on the same hardware, which is a per-token-throughput claim that tracks with the parameter-count math: a 24B dense model should run at roughly 2.9x the tokens-per-second of a 70B dense at equal precision on the same GPU. In practice, plebs running both models via llama.cpp on a 24GB card see the expected ratio — Small 3 at Q5 hits 40–50 tok/s on a 3090 where Llama 3.3 at Q4 hits 15–18 tok/s. That speed matters for interactive workloads where latency affects usability; it matters even more for batch workloads where you’re running many prompts through the model back-to-back.

Sovereign pleb implications

This is the single-GPU model of 2025 for plebs who’ve built a respectable home rig but can’t justify a multi-card setup. The VRAM math:

  • fp16: about 48GB — dual 24GB cards or a single A6000/A100 40GB+
  • Q8: about 24GB — clean on a used RTX 3090, 4090, or any 24GB card
  • Q5_K_M: about 16GB — comfortable on a 16GB 4080 or a 3090/4090 with headroom for context and tools
  • Q4_K_M: about 14GB — sweet spot for 16GB cards; fits on a 12GB card with minor offload

See the GGUF quantization guide for the quality trade-offs. On a 24B dense model, Q5–Q6 is usually the best price-performance balance if your VRAM allows.

What Small 3 replaces in a pleb stack:

  • General assistant chat + tool use on a single 24GB card: Small 3 at Q8 is now the default. The combination of native function calling, Apache 2.0, and 3x the speed of Llama 3.3 70B makes it a practical daily driver.
  • RAG backbones: Small 3’s 32K context and strong MMLU-Pro make it a reasonable RAG model, though Command R+ still wins on native grounded-citation formatting if you have the hardware.
  • Code + STEM on one card: HumanEval 84.8 and MATH 70.6 put Small 3 firmly in “competent coding assistant” territory — not as strong as Qwen 2.5-Coder 32B for pure code, but versatile enough for mixed workloads.

For inference-as-heater builds, a single 3090 or 4090 running Small 3 at Q8 under sustained load is a 350–450W heat source producing genuinely frontier-adjacent work. For plebs converting decommissioned mining sites into AI Hashcenters, Small 3’s density-per-card profile is ideal for packing more concurrent sessions onto fewer GPUs than a 70B requires.

How to run it today

Weights are on Hugging Face at mistralai/Mistral-Small-24B-Instruct-2501. Ollama registry entry is live:

ollama pull mistral-small

New to Ollama? The 10-minute Ollama install guide covers setup. For chat UI, Open WebUI pairs with Ollama cleanly. LM Studio loads GGUF quants directly — Bartowski’s quants of Mistral-Small-24B-Instruct-2501 are already on Hugging Face. For production deployment, vLLM and SGLang both support the model natively via the standard Mistral architecture path.

Hitting issues? The self-hosted AI troubleshooting guide covers the usual VRAM, quantization, and driver snags.

What comes next

Mistral’s 2024 pattern was to pair each open release with a larger closed sibling. Expect Mistral Large 3 or a similar premium tier in the coming months — priced behind the API, with Small 3 as the pleb-facing complement. Community fine-tunes will appear on Hugging Face quickly given the permissive license; watch for a Mistral-Small-24B-Coder variant and domain-specific instruct tunes within weeks.

Bigger picture: the 24B Apache-2.0 release at the end of January 2025 is a pointed statement in a landscape where many open labs have been drifting toward restrictive licensing. It’s one more layer of decentralization in the open-weight ecosystem — a frontier-adjacent model that any pleb can run, modify, and ship commercially with no asterisks. That matters. For the case, see the Sovereign AI for Bitcoiners Manifesto; for related retrospectives, Llama 3.3 sits in the same weight class on a larger card, Phi-4 is the STEM-specialist alternative at 14B, and DeepSeek V3 is what happens when you go the opposite direction on scale. For the setup, the pleb’s guide to self-hosted AI and Bitcoin space heater pages cover the hardware side.

Benchmark history

Last benchmarked: January 30, 2025 Needs refresh

Benchmark Score Source Measured
MMLU-Pro 66.3 vendor_blog  ✓ January 30, 2025
MT-Bench 8.35 vendor_blog  ✓ January 30, 2025
MATH 70.6 vendor_blog  ✓ January 30, 2025
GPQA 45.3 vendor_blog  ✓ January 30, 2025
HumanEval 84.8 vendor_blog  ✓ January 30, 2025

Recommended hardware

Runs well on 24 GB VRAM (3090 / 4090) at Q4–Q5. A used 3090 is the pleb pick.

Buying guide: used RTX 3090 for LLMs (2026) →

Get it running

  1. 01 Install Ollama →

    Ten-minute local LLM runtime. One binary, zero cloud.

  2. 02 Give it a web UI →

    Open-WebUI turns Ollama into a self-hosted ChatGPT.

  3. 03 Understand quantization →

    GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.

Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.