Skip to content

We're upgrading our operations to serve you better. Orders ship as usual from Laval, QC. Questions? Contact us

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Current

DeepSeek V3

DeepSeek · DeepSeek family · Released December 2024

DeepSeek's December 2024 frontier-scale MoE — 671B total, 37B active, trained for ~$5.6M in compute.

Model card

DeveloperDeepSeek
FamilyDeepSeek
LicenseDeepSeek License
Modalitytext
Parameters (B)671 (MoE)
Context window128000
Release dateDecember 2024
Primary languagesen,zh
Hugging Facedeepseek-ai/DeepSeek-V3
Ollamaollama pull deepseek-v3

DeepSeek V3 drops: 671B MoE open weights, trained for under $6M

DeepSeek just released DeepSeek V3 — a 671-billion parameter Mixture-of-Experts model with 37B active per token, released under a permissive license with full weights on Hugging Face. The model card and technical report are live as of today. Two numbers from the release are going to define the conversation: DeepSeek V3 claims performance competitive with GPT-4o and Claude 3.5 Sonnet on most benchmarks, and the training run reportedly cost $5.58 million in GPU-time.

That second number is the one that will keep AI-lab CFOs awake tonight. Labs have been spending nine-figure sums on frontier training runs. DeepSeek says they matched the closed frontier for under $6M. Whether that number holds up under scrutiny, the shape of the claim — open weights, MoE efficiency, public paper — is already reshaping how plebs should think about local inference economics. Below: what’s in the weights, what the benchmarks say, and what it means for a sovereign AI stack going into 2025.

What’s in the weights

DeepSeek V3 is a sparse Mixture-of-Experts decoder-only transformer. The lineage: Transformer (2017) → Google’s Switch Transformer and GShard (2021) → Mixtral 8x7B (December 2023, the first mainstream open MoE) → DeepSeek V2 (May 2024) → DeepSeek V3 today. DeepSeek has been iterating on MoE architectures since their V2 release, and V3 is the scaled, polished version of that research line.

Key specs from the technical report:

  • 671B total parameters, 37B active per token (256 experts, top-8 routing per layer)
  • Multi-head Latent Attention (MLA): DeepSeek’s signature attention variant, compresses KV cache to dramatically reduce memory at long context
  • DeepSeekMoE architecture: shared experts + routed experts, fine-grained expert specialization
  • Auxiliary-loss-free load balancing: a new training trick that stabilizes expert routing without the auxiliary balancing loss that’s standard in the MoE literature
  • Multi-Token Prediction (MTP): predicts two tokens ahead during training; disabled at inference but improves training efficiency
  • Context window: 128K tokens
  • Training data: 14.8T tokens, heavily Chinese + English, with code and math specialty corpora
  • Training compute: 2.788M H800 GPU-hours, claimed at $5.58M total cost
  • License: DeepSeek License Agreement — commercial use permitted

The architectural details matter. MLA cuts KV cache to roughly 1/14th the size of standard MHA at the same context length — which is a significant inference-cost improvement for anyone running long-context workloads. The auxiliary-loss-free balancing and multi-token prediction are both training-side innovations; they don’t change how the model behaves at inference, but they’re part of why DeepSeek could train at claimed-low cost.

Benchmarks at release

From DeepSeek’s technical report published today:

  • MMLU: DeepSeek V3 at 88.5 vs GPT-4o at 87.2 and Claude 3.5 Sonnet at 88.3 — DeepSeek at parity or slightly ahead.
  • MMLU-Pro: DeepSeek V3 at 75.9 vs GPT-4o at 73.3 and Claude 3.5 Sonnet at 78.0 — Claude ahead, DeepSeek ahead of GPT-4o.
  • GPQA Diamond: DeepSeek V3 at 59.1 vs GPT-4o at 49.9 and Claude 3.5 Sonnet at 65.0.
  • MATH-500: DeepSeek V3 at 90.2 vs GPT-4o at 74.6 and Claude 3.5 Sonnet at 78.3 — DeepSeek ahead by a large margin.
  • HumanEval: DeepSeek V3 at 82.6 vs Claude 3.5 Sonnet at 81.7 — DeepSeek slightly ahead.
  • LiveCodeBench: DeepSeek V3 at 40.5 vs GPT-4o at 36.1 — DeepSeek ahead.
  • AIME 2024 (math olympiad): DeepSeek V3 at 39.2, competitive with closed frontier models.
  • Chinese benchmarks (CMMLU, C-Eval): DeepSeek V3 decisively leads all open and closed competitors tested.

Compared to open peers, DeepSeek V3 leads Llama 3.3 70B and Qwen 2.5 72B on nearly every benchmark in the technical report. That’s the first open-weights model claiming full parity with GPT-4o and Claude 3.5 Sonnet. lmsys Chatbot Arena will sort out the real preference ranking over the coming weeks.

Sovereign pleb implications

The honest answer: DeepSeek V3 is not a home-rig model for most plebs. 671B total parameters means you need to hold a lot of VRAM to serve it, even with MoE sparsity on the compute side. The numbers:

  • fp16: about 1.3TB — serious multi-node territory.
  • Q8: about 700GB — a large GPU server (8x H100 80GB = 640GB VRAM, tight).
  • Q4_K_M (GGUF): about 400GB — runnable on a big multi-GPU workstation or with heavy CPU offload.
  • Q3 or Q2 (aggressive quant): 250–300GB — possible on a 4-way 3090 setup with offload, but quality degrades and speeds drop.

See the GGUF quant guide for the quality/size tradeoffs — at the 600B+ MoE scale, aggressive quantization hits differently than on a dense 70B because expert routing is sensitive to weight precision.

Practically, the plebs who will run DeepSeek V3 locally are the ones who’ve built a small inference rack — six to eight used 3090s, or a workstation with one or two H100s, or a CPU+RAM monster (512GB system RAM + GPU offload runs V3 at Q4 slowly). That’s a Hashcenter-scale setup, not a desktop. For everyone else, the practical local stack stays at Llama 3.3 70B or Qwen 2.5 72B size — both of which fit on a dual-3090 rig.

What DeepSeek V3 changes for plebs who don’t run it locally: the economics of hosted open-weights inference. Multiple providers (together.ai, fireworks, and DeepSeek’s own API) will offer V3 at prices well below GPT-4o or Claude’s per-token rates — because the underlying weights are open and competition will drive prices down. For plebs who want frontier quality without a Hashcenter, hosted V3 becomes the cheapest-per-token frontier option. And because the weights are public, you can audit what you’re using and move providers freely.

For plebs converting decommissioned mining sites to AI inference, V3 is the model where the Hashcenter pivot story starts looking serious. 671B MoE on a small GPU rack at Q4 is exactly the workload that makes sense for operators with cheap power and underutilized hardware.

How to run it today

DeepSeek V3 weights are on Hugging Face at deepseek-ai/DeepSeek-V3. DeepSeek’s inference reference is in their GitHub repo, with vLLM and SGLang integration paths.

For local inference, watch for llama.cpp to merge MLA support over the coming days — until then, GGUF quants exist but require patched builds. Ollama registry entry for V3 will likely appear in 1–2 weeks once the MoE + MLA plumbing settles; our 10-minute Ollama install guide will get you ready. LM Studio support will follow llama.cpp’s timeline.

For plebs without the hardware, DeepSeek offers a free-tier API at chat.deepseek.com — and openrouter.ai lists V3 as of today through several providers. If you hit issues running quants locally, the self-hosted AI troubleshooting guide covers the usual suspects.

What comes next

DeepSeek typically follows a V-release with an R-release (reasoning-tuned variant) within weeks. Expect DeepSeek R1 (or similar naming) in Q1 2025 as the RLHF-reasoning counterpart, trained on V3’s base. Community fine-tunes and specialist variants (coder, math, multilingual) will start appearing on Hugging Face — though the sheer size of V3 limits who can fine-tune it.

Bigger picture: the $5.58M training cost claim — if it holds — is the story that will matter most in 2025. It means frontier training is cheap enough that labs outside the US big three can compete. For sovereign plebs, that’s healthy: more diversity in the open-weights landscape, more pressure on closed APIs to keep prices low, and more architectural innovation in the open. The Bitcoin analogy is hard to miss — decentralization works when the network has many capable participants, not three gatekeepers. Pull the weights if you have the rig, watch the hosted prices if you don’t, and own what you can. See the Sovereign AI Manifesto for the case, and the pleb’s guide to self-hosted AI for the setup that fits whatever hardware you already have.

Benchmark history

Last benchmarked: December 26, 2024 Needs refresh

Benchmark Score Source Measured
AIME-2024 39.2 vendor_blog  ✓ December 26, 2024
MATH 90.2 vendor_blog  ✓ December 26, 2024
GPQA 59.1 vendor_blog  ✓ December 26, 2024
HumanEval 82.6 vendor_blog  ✓ December 26, 2024
MMLU 88.5 vendor_blog  ✓ December 26, 2024

Recommended hardware

Multi-GPU rig or cloud territory. For most plebs, the 70B distillation is plenty.

Buying guide: used RTX 3090 for LLMs (2026) →

Get it running

  1. 01 Install Ollama →

    Ten-minute local LLM runtime. One binary, zero cloud.

  2. 02 Give it a web UI →

    Open-WebUI turns Ollama into a self-hosted ChatGPT.

  3. 03 Understand quantization →

    GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.

Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.