Skip to content

We're upgrading our operations to serve you better. Orders ship as usual from Laval, QC. Questions? Contact us

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Superseded

Qwen 2.5

Alibaba · Qwen family · Released September 2024

Alibaba's September 2024 Qwen family spans 0.5B to 72B, plus coding and math specialists — mostly Apache 2.0.

Model card

DeveloperAlibaba
FamilyQwen
LicenseApache-2.0 (most sizes)
Modalitytext
Parameters (B)0.5,1.5,3,7,14,32,72
Context window128000
Release dateSeptember 2024
Primary languagesen,zh,ja,ko,fr,de,es,ar,ru,pt,it
Hugging FaceQwen/Qwen2.5-7B-Instruct
Ollamaollama pull qwen2.5

Qwen 2.5 drops: Alibaba releases 100+ open models in a single day

Alibaba Cloud just announced Qwen 2.5 at the Apsara Conference today — and the release scale is unlike anything we’ve seen in open weights. Over 100 models in a single drop: base and instruct variants across seven parameter sizes (0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B), plus specialized Qwen2.5-Coder and Qwen2.5-Math variants, plus GGUF, AWQ, and GPTQ quantized versions of each. All on Hugging Face, all today. The Alibaba Cloud release blog and Qwen org on Hugging Face have the full lineup.

The headline claim: Qwen 2.5 72B outperforms Llama 3.1 405B on several benchmarks at a fifth of the active parameters. That’s a bold pitch — Meta’s 405B was the biggest open flagship on earth when it dropped two months ago — but Qwen 2.5 has the receipts. Below: what’s in the weights, what the benchmarks look like at launch, and what a 100-model open drop means for sovereign plebs building a local AI stack.

What’s in the weights

Qwen is Alibaba’s foundation model line, developed by the Qwen team at Alibaba Cloud Intelligence. The lineage: Transformer (2017) → Qwen 1 (September 2023) → Qwen 1.5 (February 2024) → Qwen 2 (June 2024) → Qwen 2.5 today. Architecturally, Qwen 2.5 is a refined decoder-only transformer — no MoE in the main chat line (though Alibaba has shipped MoE variants elsewhere). The refinements over Qwen 2 are in the training data and post-training pipeline, not the architecture.

Key specs across the family:

  • Seven base sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
  • Context window: 128K tokens across all sizes, with 8K output length
  • Training data: 18 trillion tokens (up from Qwen 2’s 7T) — significantly more than Llama 3’s 15T
  • Multilingual: 29+ languages, with strong Chinese, English, French, Spanish, Russian, Arabic, Japanese, Korean
  • License: Apache 2.0 for 0.5B, 1.5B, 7B, 14B, 32B; Qwen License (a custom, permissive-but-not-Apache license) for 3B and 72B
  • Attention: Grouped-Query Attention, RoPE, no sliding window on most sizes

Specialized variants

  • Qwen 2.5-Coder (1.5B, 7B, and 32B): trained on 5.5T tokens of source code, targeted at coding tasks. The 32B-Coder is the most capable open code model Alibaba has shipped.
  • Qwen 2.5-Math (1.5B, 7B, 72B): specialized math-reasoning fine-tunes with a Chain-of-Thought + Tool-Integrated Reasoning (TIR) pipeline.

The 128K context window across the entire size ladder is notable. A 3B model with 128K context is an unusual offering — most sub-5B models cap at 8–32K context — and it makes Qwen 2.5 3B a credible candidate for on-device RAG workflows where the retrieved context can be long.

Benchmarks at release

From Alibaba’s release blog, published today:

  • MMLU (5-shot): Qwen 2.5 72B at 86.1 vs Llama 3.1 70B at 82.0 and Llama 3.1 405B at 87.3 — 72B lands within a point of the 405B.
  • MMLU-Pro: Qwen 2.5 72B at 58.1 vs Llama 3.1 70B at 52.8, Llama 3.1 405B at 61.6.
  • MATH (0-shot CoT): Qwen 2.5 72B at 83.1 vs Llama 3.1 405B at 73.8 — Qwen 2.5 ahead by nearly 10 points on math.
  • HumanEval (code, 0-shot): Qwen 2.5 72B at 86.6 vs Llama 3.1 405B at 89.0 — Llama slightly ahead on this particular code benchmark.
  • MBPP (code): Qwen 2.5 72B at 88.2 vs Llama 3.1 405B at 87.8 — Qwen 2.5 very slightly ahead.
  • GSM8K: Qwen 2.5 72B at 95.8 vs Llama 3.1 405B at 96.8 — effectively tied.
  • Qwen 2.5-Coder 32B: claimed state-of-the-art among open code models on HumanEval, MBPP, and LiveCodeBench.
  • Chatbot Arena: preliminary lmsys numbers put Qwen 2.5 72B-Instruct in the top tier of open models, competitive with Llama 3.1 405B-Instruct.

These are Alibaba-published numbers. Expect the Open LLM Leaderboard to rank the smaller Qwen 2.5 sizes over the next few days, and lmsys Arena voting to settle the 72B’s real standing over the next month.

Sovereign pleb implications

This release is a gift to every tier of local rig. The coverage from 0.5B (laptop / phone / Pi) to 72B (dual-3090 flagship) means there’s a Qwen 2.5 variant for every pleb setup. VRAM math at Q4_K_M:

  • Qwen 2.5 0.5B: ~400MB. Runs on a Raspberry Pi 5 or any phone.
  • Qwen 2.5 1.5B: ~1GB. Laptop-CPU or any small GPU.
  • Qwen 2.5 3B: ~2GB. Runs on a 4GB+ GPU with room for 128K context. Excellent Home Assistant / Obsidian integration candidate — see our guide.
  • Qwen 2.5 7B: ~4.5GB. Sweet spot for a single mid-range GPU.
  • Qwen 2.5 14B: ~9GB. Single 3060 / 4070 territory.
  • Qwen 2.5 32B: ~20GB. Single used 3090 (24GB) at Q4; the 32B-Coder is the interesting variant here for plebs who want a local coding assistant.
  • Qwen 2.5 72B: ~42GB. Dual 3090s (48GB). The home flagship tier.

See the GGUF quant guide for the Q4/Q5/Q6 tradeoffs at each size. On 72B, Q5_K_M is ~51GB — still workable on dual 3090s with tight context; Q4_K_M is the safer default.

What this replaces in a daily stack:

  • Multilingual work: Qwen has been the multilingual open leader since Qwen 1.5; 2.5 widens the gap further. If you’re working in Chinese, Japanese, Korean, Arabic, or any non-English language as a daily driver, Qwen 2.5 at your VRAM tier is the upgrade.
  • Coding: Qwen 2.5-Coder 32B is a credible replacement for Copilot / ChatGPT on code tasks, at a model size that fits on a single 3090.
  • Math / reasoning: Qwen 2.5-Math is the first open model family to treat math as a first-class specialization with a tuned variant in three sizes.
  • General chat: for plebs running Llama 3.1 70B, Qwen 2.5 72B is a direct competitor worth head-to-head testing on your own workloads.

For Hashcenter operators, the seven-size ladder is the practical feature: you can deploy the same model family across laptop edge nodes, small GPU hosts, and dual-GPU flagship servers, with identical tokenizers and instruction formats. That’s a big operational simplification.

How to run it today

Qwen 2.5 is on the Ollama registry at release:

ollama pull qwen2.5:0.5b
ollama pull qwen2.5:3b
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen2.5-coder:32b

New to Ollama? Our 10-minute install guide walks through setup. Pair with Open WebUI for a clean chat UI. For GUI loaders, LM Studio has Qwen 2.5 GGUFs via its Hugging Face browser. Official weights are on the Qwen HF org.

What comes next

Alibaba teased Qwen-VL 2.5 (multimodal) and Qwen-Audio 2.5 for later releases. The Qwen team has been shipping a new major version roughly every 3–4 months, so expect a Qwen 3 cycle in early 2025. Community fine-tunes and merges will appear on Hugging Face within days — Qwen’s Apache 2.0 core sizes make that a low-friction path.

For sovereign plebs, the bigger picture: the open-weights race is no longer a US-only event. A Chinese model family just shipped 100+ weights on a single day and claimed parity with Meta’s 405B. That’s pressure on every frontier lab — closed or open — to keep shipping better weights faster. Pull the Qwen 2.5 variant that fits your rig, test it against whatever you’re running today, own the stack. See the Sovereign AI Manifesto for the case, and the pleb’s guide to self-hosted AI for the setup.

Further reading: The same pleb-grade infrastructure that runs local inference also runs a Bitcoin space heater. Many readers arrive from the mining side — see From S19 to Your First AI Hashcenter for the bridge.

Benchmark history

Last benchmarked: September 19, 2024 Needs refresh

Benchmark Score Source Measured
MMLU-Pro 71.1 vendor_blog  ✓ September 19, 2024
MT-Bench 9.35 vendor_blog  ✓ September 19, 2024
MATH 83.1 vendor_blog  ✓ September 19, 2024
GPQA 49 vendor_blog  ✓ September 19, 2024
HumanEval 86.6 vendor_blog  ✓ September 19, 2024
MMLU 86.1 vendor_blog  ✓ September 19, 2024

Recommended hardware

Needs dual 3090 / 4090 for Q4, or a single 48 GB card (5090 / A6000) for headroom.

Buying guide: used RTX 3090 for LLMs (2026) →

Get it running

  1. 01 Install Ollama →

    Ten-minute local LLM runtime. One binary, zero cloud.

  2. 02 Give it a web UI →

    Open-WebUI turns Ollama into a self-hosted ChatGPT.

  3. 03 Understand quantization →

    GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.

Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.