Qwen 3
Alibaba · Qwen family · Released May 2025
Alibaba's May 2025 release — first open family with hybrid reasoning (toggle-able chain of thought), Apache 2.0 across all sizes.
Model card
| Developer | Alibaba |
|---|---|
| Family | Qwen |
| License | Apache-2.0 |
| Modality | text |
| Parameters (B) | 0.6,1.7,4,8,14,32,30 (MoE),235 (MoE) |
| Context window | 131072 |
| Release date | May 2025 |
| Primary languages | en,zh,ja,ko,fr,de,es,ar,ru,pt,it |
| Hugging Face | Qwen/Qwen3-8B |
| Ollama | ollama pull qwen3 |
Alibaba’s Qwen team just released Qwen3, and the open-weight landscape keeps getting more crowded in the best possible way. Eight models ship today across two architectural families: six dense models from 0.6B to 32B parameters, and two mixture-of-experts models—Qwen3-30B-A3B (30B total, 3B active) and Qwen3-235B-A22B (235B total, 22B active). Every model supports a hybrid reasoning mode that plebs can toggle per-query, a context window up to 128K tokens, and multilingual capability across 119 languages.
This is the release that makes Qwen a first-class open-weight citizen alongside Llama and DeepSeek. For sovereign plebs, the real stars are Qwen3-30B-A3B (runs fast on consumer hardware thanks to the MoE architecture) and Qwen3-32B (the dense flagship for single-GPU rigs). The Apache 2.0 license on everything means no commercial-use gotchas. Pull it, run it, use it commercially—no license drama, no "acceptable use" corporate weasel clauses.
What’s in the weights
Qwen’s research lineage runs through a distinct series of releases since 2023: Qwen 1 (August 2023, 7B/14B), Qwen 1.5 (February 2024, more sizes), Qwen 2 (June 2024, added MoE variant), Qwen 2.5 (September 2024, the release that established Qwen as a serious pleb option with strong code and math performance), and now Qwen3. Each generation has meaningfully improved reasoning and multilingual capability. The Qwen3 release blog describes this as the "biggest jump" in the family’s history, which is corporate puff but plausibly accurate given what’s in the weights.
The eight models:
- Qwen3-0.6B: Phone-class. 32K context. Dense.
- Qwen3-1.7B: Small laptop / Raspberry Pi tier. 32K context.
- Qwen3-4B: Entry-level GPU tier. 32K context.
- Qwen3-8B: Single-GPU daily driver. 128K context.
- Qwen3-14B: 24GB card sweet spot. 128K context.
- Qwen3-32B: Dense flagship. 128K context. Fits on a 3090 at Q4.
- Qwen3-30B-A3B: MoE, 30B total / 3B active, 128K context. Very interesting for plebs.
- Qwen3-235B-A22B: Frontier MoE, 235B total / 22B active, 128K context. Enterprise or well-funded pleb only.
The big architectural story is the hybrid thinking mode. Every Qwen3 model ships with a toggleable "thinking" behavior—enable it, the model produces a <think>...</think> chain-of-thought before its final answer, similar to how DeepSeek R1 works. Disable it, the model responds directly with no reasoning trace. Plebs control this via prompt (append /think or /no_think) or via generation-time parameters. This is a cleaner UX than requiring separate "thinking" and "instant" models—one set of weights, two behaviors, pleb’s choice per query.
Under the hood, Qwen3 dense models use the now-standard decoder Transformer with Grouped Query Attention, RoPE, and SwiGLU. The MoE variants use 128 experts for the 30B-A3B (8 active per token) and similar expert routing for the 235B-A22B. Training data: Alibaba reports 36 trillion tokens of pretraining data, roughly double Qwen 2.5’s 18T budget, with heavy emphasis on code, math, and multilingual content. The Qwen3 GitHub has the technical details including the multi-stage training pipeline: standard pretraining, then reasoning-focused RL for the thinking mode, then a final blend stage that preserves both modes in the same weights.
Benchmarks at release
Per Alibaba’s release blog and technical report, Qwen3 performance on public benchmarks at release:
- Qwen3-235B-A22B (thinking): AIME 2024 at 85.7, LiveCodeBench at 70.7, GPQA at 71.1—competitive with DeepSeek R1 and o1
- Qwen3-32B (thinking): AIME 2024 at 81.4, strong math and code performance for a dense 32B
- Qwen3-30B-A3B (thinking): AIME 2024 at 80.4—a 3B-active-parameter model scoring near the 32B dense, at far lower inference cost
- Qwen3-8B (thinking): AIME 2024 at 76.0, MMLU-Redux at 77.4
- MMLU: 235B-A22B at 87.8, 32B at 83.1, 30B-A3B at 82.8
- Multilingual benchmarks: Qwen3 claims meaningful capability across 119 languages, a sharp expansion from Qwen 2.5’s 29-language claim
The standout number is Qwen3-30B-A3B’s performance relative to its inference cost. A 3B-active-parameter MoE performing comparably to dense 32B models means plebs can run the 30B-A3B at near-3B inference speeds with near-32B capability, assuming your hardware can hold the full 30B parameter set in VRAM or unified memory. Community reproductions on the Open LLM Leaderboard and LMSys Arena will refine these numbers over the next few weeks—treat Alibaba’s self-reported benchmarks as directional until independent evaluators confirm.
What it means for the sovereign pleb
The sovereign AI manifesto argues plebs should own their inference stack end-to-end. Qwen3 adds two compelling options to the pleb arsenal. The dense 32B is a straightforward upgrade to Qwen 2.5 32B for single-GPU rigs. The MoE 30B-A3B is more interesting: it fits in similar VRAM to a dense 32B but runs inference at roughly 3B-parameter speed. For plebs generating large volumes of text (agent loops, long-form drafts, batch processing), that speed advantage is material.
VRAM requirements at Q4_K_M:
- Qwen3-4B: ~3GB — any GPU with 8GB+ or M-series Mac
- Qwen3-8B: ~5GB — RTX 3060 12GB, low-end gaming laptops
- Qwen3-14B: ~9GB — RTX 3060 12GB headroom, RTX 4070, Mac with 16GB+
- Qwen3-32B: ~20GB — single RTX 3090/4090, or Mac with 32GB+ unified memory
- Qwen3-30B-A3B: ~20GB — same VRAM as 32B dense but ~10× faster inference thanks to MoE sparsity
- Qwen3-235B-A22B: ~140GB — not a home-pleb model, needs an enterprise rig
For the used RTX 3090 pleb rig, Qwen3-30B-A3B is the new default for speed-sensitive workloads (agents, tool-use loops, long-generation tasks). Qwen3-32B dense is the default for quality-sensitive workloads where raw benchmark performance matters more than tokens-per-second. Having both on the same 24GB card (one at a time) means plebs can pick the right tool per job. Quant selection follows our GGUF explainer—Q4_K_M for VRAM-constrained, Q8 when you have the room.
The hybrid thinking mode is the other pleb win. Want a quick response? Append /no_think. Want the model to reason hard? /think. This means a single deployed Qwen3 serves both the "quick chat" and "hard problem" roles that previously required separate model deployments. For an Open WebUI setup, this simplifies the model-picker logic dramatically—one Qwen3, two modes, routed by user prompt convention or UI toggle.
Apache 2.0 licensing across the entire Qwen3 family is the unspoken headline. No license acceptance on Hugging Face. No "acceptable use" clauses. No research-only restrictions. Plebs running commercial Hashcenter workflows—selling inference, embedding models in products, building agent services—can use Qwen3 freely. Contrast with Llama’s 700M MAU limitation or Gemma’s slightly more restrictive community license: Qwen3 is the most commercially permissive frontier-class open model on the market today.
For S19-to-AI-Hashcenter conversions, Qwen3-30B-A3B is an attractive workload candidate. MoE inference has lower per-token compute demand than dense, which means either better tokens-per-second per GPU or the ability to run more concurrent users on the same hardware. Heating with inference economics still work—the compute is real even if it’s sparser—and the commercial-use license makes paid inference services straightforwardly legal.
How to run it today
Quickstart via Ollama:
ollama pull qwen3:8b
ollama pull qwen3:32b
ollama pull qwen3:30b-a3b
ollama run qwen3:30b-a3b
To enable thinking mode, append /think to your prompt; for fast non-reasoning responses, use /no_think. Open WebUI should handle the <think> blocks and render them collapsibly by default.
Hugging Face source: Qwen/Qwen3-32B and the full Qwen3 family under the Qwen org. GGUF quants from community maintainers (bartowski, unsloth) typically appear within 24 hours. LM Studio users should see the models indexed quickly; our runner comparison covers tradeoffs for pleb setups. For troubleshooting MoE-specific loading issues (which can be fussy with some quants), our troubleshooting guide is the pleb reference.
For pleb-stack integration, see the self-hosted AI pleb guide. A reasonable Qwen3 deployment replaces two or three previous-generation models: use Qwen3-30B-A3B for general chat and agent work, Qwen3-32B in thinking mode for hard reasoning problems, and one of the smaller dense variants for always-on, low-latency routing tasks. If you’re integrating with Home Assistant or Obsidian, Qwen3-4B is a strong candidate for the always-on classifier role.
What comes next
Alibaba’s Qwen team has been the most consistent open-weight ship-cycle of any major lab over the past two years. The release cadence—roughly one major generation every 6–9 months, with frequent point releases and specialized variants (Qwen-Coder, Qwen-VL, Qwen-Audio) in between—signals a committed open-weight strategy, not a PR gesture. Qwen3 extends that trajectory with the hybrid thinking architecture and the push into 119-language multilingual support.
Looking forward: Qwen-VL for Qwen3 (vision variant) is almost certainly in the pipeline, as is a Qwen-Coder-3 with tighter code specialization. Alibaba’s public statements around agentic capability suggest further investment in tool-use and long-horizon task performance. And the MoE architecture decision for the flagship 235B-A22B suggests Alibaba is committed to mixture-of-experts as the frontier path, following DeepSeek V3’s lead.
For plebs today, the action item is concrete. Pull Qwen3-30B-A3B for fast general work. Pull Qwen3-32B for heavy reasoning. Put them behind Open WebUI. Run them on your own hardware. Apache 2.0 license means you can build anything you want on top of these weights—commercial products, paid services, Hashcenter-hosted inference APIs for other plebs—without a single call to a legal team.
Sovereignty, shipping, Apache 2.0, MoE efficiency, hybrid thinking. Qwen3 is the release that makes the open-weight future look inevitable.
Benchmark history
Last benchmarked: April 16, 2026 Current
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MMLU | 88.7 % | dcentral_lab ✓ | April 16, 2026 |
| MMLU-Pro | 83 | vendor_blog ✓ | April 28, 2025 |
| AIME-2024 | 85.7 | vendor_blog ✓ | April 28, 2025 |
| MATH | 71.84 | vendor_blog ✓ | April 28, 2025 |
| GPQA | 77.5 | vendor_blog ✓ | April 28, 2025 |
| MMLU | 87.81 | vendor_blog ✓ | April 28, 2025 |
Recommended hardware
Multi-GPU rig or cloud territory. For most plebs, the 70B distillation is plenty.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
