DeepSeek R1
DeepSeek · DeepSeek family · Released January 2025
DeepSeek's January 2025 reasoning model — frontier chain-of-thought quality, plus six MIT-licensed distills from 1.5B to 70B.
Model card
| Developer | DeepSeek |
|---|---|
| Family | DeepSeek |
| License | MIT (most distills) |
| Modality | text |
| Parameters (B) | 1.5,7,8,14,32,70,671 (MoE) |
| Context window | 128000 |
| Release date | January 2025 |
| Primary languages | en,zh |
| Hugging Face | deepseek-ai/DeepSeek-R1 |
| Ollama | ollama pull deepseek-r1 |
DeepSeek just released R1—and the open-weight world now has its first genuine reasoning model. While the AI industry has spent the past four months whispering about OpenAI’s o1 and its chain-of-thought reasoning behind the closed-model curtain, a Chinese lab most people couldn’t pronounce a year ago has shipped something that matches or beats o1 on several public benchmarks, under an MIT license, with the weights on Hugging Face.
This is a bigger deal than the leaderboard numbers suggest. DeepSeek R1 isn’t just a model release—it’s a demonstration that reinforcement learning from reasoning traces, done at scale without human-labeled chain-of-thought data, is a real and reproducible technique. The accompanying paper shows the recipe. For sovereign plebs, that means the reasoning-model moat OpenAI tried to build by keeping o1’s thinking hidden just got filled in with concrete.
What’s in the weights
DeepSeek R1 is a descendant of DeepSeek’s December 2024 V3 model—a 671B parameter mixture-of-experts architecture with 37B activated parameters per token. V3 itself was a quiet shock: a frontier-scale model trained for a reported $5.6M in compute (a figure the community has scrutinized but not disproved), released under open weights in a moment when everyone assumed only Google, OpenAI, Anthropic, and Meta could afford frontier training runs. R1 takes V3 as a base and applies a two-stage training pipeline described in the R1 paper:
- R1-Zero: Pure reinforcement learning on V3-Base with rule-based rewards (is the math answer correct, does the code compile and pass tests). No supervised fine-tuning, no human-curated reasoning traces. The model learns to produce long chain-of-thought "thinking" sections autonomously, emergent from reward signal alone. The DeepSeek paper frames this as evidence that reasoning capability doesn’t require human-labeled thought data.
- R1: A more polished variant with a cold-start supervised fine-tune on a small curated dataset, then RL, then rejection sampling, then a final RL stage. Much better at producing readable, well-formatted reasoning. This is the model plebs actually want to run.
Alongside R1, DeepSeek released six distilled models—smaller, dense models trained on reasoning traces generated by R1 itself. These are the pleb-accessible piece of the release:
- DeepSeek-R1-Distill-Qwen-1.5B: Runs on a phone. Seriously.
- DeepSeek-R1-Distill-Qwen-7B: Single-GPU daily driver.
- DeepSeek-R1-Distill-Llama-8B: Llama 3.1 8B with reasoning baked in.
- DeepSeek-R1-Distill-Qwen-14B: The sweet spot for 24GB cards.
- DeepSeek-R1-Distill-Qwen-32B: The pleb flagship. Beats o1-mini on math and code per DeepSeek’s release.
- DeepSeek-R1-Distill-Llama-70B: Based on Llama 3.3 70B, for dual-3090 rigs.
Architecturally, R1 proper is V3’s MoE—671B total, 37B active, 128K context, multi-head latent attention (MLA), and an auxiliary-loss-free load balancing scheme that DeepSeek detailed in V3’s paper. The distills are conventional dense Transformers from the Qwen 2.5 and Llama 3 families. MIT license on all of them. No corporate weasel clauses.
Benchmarks at release
Per DeepSeek’s release post and paper, R1 on the public benchmarks most people quote:
- AIME 2024 (competition math): R1 scores 79.8%, matching OpenAI o1-1217’s 79.2%. o1-mini lands around 63.6%.
- MATH-500: R1 at 97.3%, edging o1-1217’s 96.4%.
- Codeforces Elo: R1 at 2029, competitive with o1’s 2061.
- MMLU: R1 at 90.8, o1 at 91.8—essentially tied.
- GPQA Diamond (graduate-level science): R1 at 71.5, o1 at 75.7.
The distilled models are the surprise. DeepSeek-R1-Distill-Qwen-32B scores 72.6 on AIME 2024, beating o1-mini’s 63.6. A 32B parameter model you can run on a single A6000 outperforming OpenAI’s smaller reasoning product. This is not a subtle moment.
Caveats worth stating at release: these are DeepSeek’s self-reported numbers. The LMSys Chatbot Arena hasn’t accumulated enough votes for a reliable ranking yet, and community reproductions on the Open LLM Leaderboard will take weeks. Reasoning-heavy benchmarks are also notoriously sensitive to prompt formatting, so expect some variance from independent evaluators.
What it means for the sovereign pleb
Until today, if you wanted reasoning-model capability—the o1-style "think hard before answering, show your work, catch your own mistakes" behavior—you paid OpenAI per token and sent your queries through their infrastructure. R1-Distill-Qwen-32B changes that. You can run a reasoning model locally. You can pipe research queries through it. You can put it behind Open WebUI and have a private reasoning assistant that never sees a corporate API.
VRAM requirements for the distilled series at Q4_K_M:
- 1.5B: ~1GB — runs on a Raspberry Pi 5 with 8GB RAM at bearable speeds
- 7B / 8B: ~5GB — RTX 3060 12GB, Mac M-series with 16GB, any modest GPU
- 14B: ~9GB — fits on a single RTX 3060 12GB, RTX 4070, or Mac with 24GB+ unified memory
- 32B: ~20GB — single RTX 3090/4090, A5000, or an M-series Mac with 32GB+ unified memory. This is the pleb flagship.
- 70B: ~40GB — dual 3090/4090 or A6000. Same VRAM budget as Llama 3.1 70B.
For our recommended used RTX 3090 pleb rig, the 32B distill is the single best model released in the past year for the "one card, one model" config. It leaves 4GB of headroom on a 24GB card for a generous KV cache, which you need if you’re doing long chain-of-thought reasoning. Quant selection follows the usual logic we cover in the GGUF quant explainer—Q4_K_M for VRAM-constrained plebs, Q8 if you have room and want near-FP16 quality.
One thing to know about running reasoning models locally: they’re slow. Not because of the architecture, but because the model generates 1,000–10,000 tokens of "thinking" before delivering its final answer. On a 3090 at Q4_K_M, a single R1-Distill-32B response might take 30 seconds to two minutes. This is normal. It’s also why these models are perfect for batch work—research queries, code review, math problems—rather than interactive chat. Queue your questions, run overnight, wake up to answers.
For the self-hosted AI pleb stack, R1-Distill-32B plus an Open WebUI frontend plus the existing Llama 3.1 70B for fast chat is the new three-model default. Reasoning for hard problems, general chat for daily driver, and small-model quick responses from a distilled 8B.
If you’re building a Hashcenter converted from retired ASIC hardware—see the S19 conversion playbook—reasoning models are compute-intensive and heat-generating in a way that aligns well with the economic argument. A reasoning batch job that produces valuable output while heating your home is the cleanest version of the inference-heat thesis. Heating with inference has the power math.
How to run it today
Quickstart via Ollama:
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:8b
ollama run deepseek-r1:32b
Ollama’s default tags pull Q4_K_M. The reasoning traces are wrapped in <think>...</think> tags in the output—most UIs including Open WebUI collapse these into expandable sections automatically.
Hugging Face source: deepseek-ai/DeepSeek-R1 for the full MoE, and deepseek-ai/DeepSeek-R1-Distill-Qwen-32B for the pleb flagship. GGUF quants from community maintainers (bartowski, unsloth) typically appear within 24 hours of release. LM Studio users: check the in-app search today; see our runner comparison if you’re choosing between frontends. For debugging slow or failed loads, the self-hosted AI troubleshooting guide is the pleb reference.
What comes next
DeepSeek has shown their hand on two fronts. First, the R1 paper is an implicit recipe for anyone else who wants to train a reasoning model—expect Qwen, Mistral, and the Llama team to respond within months with their own reasoning-tuned releases. Second, the economics of the V3 training run, if they hold up to scrutiny, suggest frontier-scale training is vastly cheaper than the industry’s prevailing $100M+ estimates. That has implications for capital allocation across the whole AI stack, which the Hashcenter pivot thesis has been tracking.
For plebs, today’s message is clear: reasoning models are no longer a closed-lab capability. Pull the 32B distill, put it behind Open WebUI, and you have a private o1-competitive assistant running in your closet. The sovereign-AI thesis just took another layer of proprietary tech and made it local.
Run your own reasoning. The frontier labs are not your friend.
Further reading: For the philosophical case behind running this model locally rather than renting it from a frontier lab, read the Sovereign AI for Bitcoiners Manifesto.
Benchmark history
Last benchmarked: January 20, 2025 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| AIME-2024 | 79.8 | vendor_blog ✓ | January 20, 2025 |
| MATH | 97.3 | vendor_blog ✓ | January 20, 2025 |
| GPQA | 71.5 | vendor_blog ✓ | January 20, 2025 |
| MMLU | 90.8 | vendor_blog ✓ | January 20, 2025 |
Recommended hardware
Multi-GPU rig or cloud territory. For most plebs, the 70B distillation is plenty.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
