Phi-4
Microsoft · Phi family · Released December 2024
Microsoft Research's December 2024 Phi-4 — a 14B dense MIT-licensed model punching well above its weight on math and reasoning.
Model card
| Developer | Microsoft |
|---|---|
| Family | Phi |
| License | MIT |
| Modality | text |
| Parameters (B) | 14 |
| Context window | 16384 |
| Release date | December 2024 |
| Primary languages | en |
| Hugging Face | microsoft/phi-4 |
| Ollama | ollama pull phi4 |
Phi-4 released: Microsoft’s 14B punches at 70B-class benchmarks
Microsoft just dropped Phi-4 — a 14-billion parameter language model that, per the release, matches or beats much larger open and closed competitors on STEM-focused reasoning benchmarks. The model lands today in research preview on Azure AI Foundry, with the full technical report on arXiv. Weights will follow on Hugging Face under the MIT License.
Phi-4 is the fourth generation of Microsoft’s small-model research line, and the one that most clearly argues the Phi thesis: data quality beats parameter count. Where competitors have been scaling parameters (DeepSeek V3’s 671B last week, Llama 3.3’s 70B earlier this month), Microsoft is scaling training data curation and synthetic data generation, keeping the model small, and claiming frontier-class reasoning performance. Below: what’s in the model, what the benchmarks say at launch, and what a 14B that competes at STEM means for a sovereign pleb stack.
What’s in the weights
Phi-4 is a 14B decoder-only transformer. Architecturally, it’s a close evolution of Phi-3-Medium (14B, released in April 2024) — Microsoft explicitly notes in the technical report that “phi-4 makes minimal changes to the phi-3 architecture.” The gains come from the data and post-training side, not from architectural novelty.
The Phi lineage: Transformer (2017) → Phi-1 (June 2023, 1.3B, code-specialized) → Phi-1.5 → Phi-2 (December 2023, 2.7B) → Phi-3 (April 2024, 3.8B / 7B / 14B) → Phi-4 today. The research line has been led by Sébastien Bubeck and collaborators at Microsoft Research, and the core thesis — that carefully curated + synthetic training data can deliver outsized capabilities per parameter — has held up through four generations.
Key specs:
- 14B parameters, dense decoder-only transformer
- Context window: 16K tokens (a clear step up from Phi-3-Medium’s 4K/128K variants; Phi-4 lands at 16K native)
- Training data: ~9.8T tokens, with heavy emphasis on synthetic data generated by larger teacher models (including GPT-4)
- Training approach: multi-stage curriculum, with synthetic math / reasoning / code data, and DPO-based post-training
- Tokenizer: 100,352-vocabulary SentencePiece
- License: MIT License (on HF release) — fully permissive, commercial OK
The standout detail is the training data strategy. Phi-4 was trained primarily on synthetic data — textbook-quality examples generated by larger models (GPT-4 and a “pivotal teacher model” mentioned in the technical report), filtered aggressively for quality. The technical report is explicit: Phi-4 “substantially surpasses its teacher model on STEM-focused QA capabilities.” That’s an important claim — it’s saying that student models can outperform teachers on specific axes when the curriculum is designed carefully. Whether that generalizes to non-STEM tasks is the open question the community will test.
Benchmarks at release
From Microsoft’s technical report, published today:
- MMLU: Phi-4 at 84.8 vs Llama 3.3 70B at 86.0 and GPT-4o at 88.1 — Phi-4 is within two points of the 70B-class.
- GPQA Diamond: Phi-4 at 56.1 vs Llama 3.3 70B at 50.5 and GPT-4o at 50.6 — Phi-4 is ahead of both on graduate-level STEM reasoning.
- MATH: Phi-4 at 80.4 vs Llama 3.3 70B at 77.0 and GPT-4o at 74.6 — Phi-4 ahead.
- HumanEval (code): Phi-4 at 82.6 vs Llama 3.3 70B at 88.4 and GPT-4o at 90.6 — Phi-4 trails here; code is not its strongest suit.
- MGSM (multilingual grade-school math): Phi-4 at 80.6 vs Llama 3.3 70B’s 87.0 — Phi-4 trails on multilingual.
- AMC-10/12 (high school math competitions): Phi-4 at 91.8 — best-in-class for its size.
- SimpleQA (fact recall): Phi-4 lower than larger models, reflecting the synthetic-data training — Phi-4 is reasoning-strong, knowledge-moderate.
The pattern is consistent. Phi-4 wins on STEM, math, and reasoning. It loses on long-tail factual recall, multilingual coverage, and parts of code. That’s the expected profile for a data-curated 14B: it’s sharp where the curriculum focused, weaker where it didn’t. The Open LLM Leaderboard will slot it in the next few days.
Sovereign pleb implications
Phi-4 at 14B lands in a particularly useful VRAM tier for plebs. Math at Q4_K_M:
- Phi-4 at fp16: about 28GB. Fits on a single used 3090 (24GB) with light offload, clean on a 32GB or 48GB card.
- Phi-4 at Q8: about 15GB. Comfortable single 3090, 4080, 4090, or 16GB 4060 Ti with room.
- Phi-4 at Q5_K_M: about 10GB. Sweet spot for a 12GB 3060 or 3080.
- Phi-4 at Q4_K_M: about 8GB. Single 8GB card territory, runs comfortably on a 3060 Ti or 4060.
See the GGUF quant guide for the quality tradeoffs. On 14B, the Q5–Q6 range is usually the best price-performance point if VRAM allows.
What this replaces in a pleb stack:
- STEM / math / reasoning workloads: Phi-4 is now the open-weights go-to at sub-20B. If you’ve been using Qwen 2.5 14B or Gemma 2 27B for math, run Phi-4 head-to-head on your specific tasks — it may be a clean upgrade at smaller size.
- Coding: don’t replace Qwen 2.5-Coder 32B with Phi-4. Phi-4 trails on code. Keep your code model; add Phi-4 as a STEM specialist.
- Factual chat / long-tail knowledge: don’t replace Llama 3.3 70B or Qwen 2.5 72B for general chat. Phi-4’s synthetic-heavy training leaves gaps on obscure facts that bigger models fill.
- Single-GPU home rigs: Phi-4 at Q5 on a 12GB card is the new “high-end reasoning at home” option for plebs who can’t step up to dual-GPU. This is a meaningful upgrade over anything else in the 14B class.
For inference-as-heater builds, a single 3090 pushing Phi-4 at sustained load is a 350W heat source running frontier-class STEM work — excellent thermal profile. For small Hashcenter operators, 14B at Q8 lets you pack more concurrent sessions onto a single GPU host than 70B-class models, which matters when you’re serving multiple users per card.
How to run it today
Phi-4 is currently in research preview on Azure AI Foundry. Microsoft has announced that weights will hit Hugging Face under the MIT License within the coming weeks — and community mirrors of the preview weights may appear sooner. Once on HF, Phi-4 will land on the Ollama registry in short order:
ollama pull phi4
(That command will work once the registry entry lands — expect within a few days of the HF weights going public.) New to Ollama? The 10-minute Ollama install guide covers setup. For a chat UI, Open WebUI works cleanly.
LM Studio users: watch for Bartowski’s GGUF quants of Phi-4 to appear on Hugging Face within 24 hours of the MIT release. Hitting issues? The self-hosted AI troubleshooting guide covers the usual GPU and loading snags.
What comes next
Microsoft will almost certainly ship a Phi-4-mini variant (smaller, faster, edge-targeted) following the pattern from Phi-3. Expect instruction-tuned variants beyond the default today, and a multimodal Phi-4-Vision is plausible given Microsoft’s prior Phi-3-Vision release. Community fine-tunes will appear on Hugging Face once the MIT-licensed weights are public — the permissive license makes derivative work frictionless.
Bigger picture: Phi-4 is the strongest validation yet of the data-over-parameters thesis. If a 14B can compete with 70B-class on STEM, the question “how big does your model need to be?” has a new, more interesting answer — “it depends on the workload, and probably smaller than you thought.” For sovereign plebs, that’s excellent news: smaller models fit on cheaper hardware, run faster, and heat less — while still doing real work. Pull the weights when they hit HF, test Phi-4 against your math and reasoning workloads, and own your stack. See the Sovereign AI Manifesto for the case, and the pleb’s guide to self-hosted AI for the setup.
Further reading: The same pleb-grade infrastructure that runs local inference also runs a Bitcoin space heater. Many readers arrive from the mining side — see From S19 to Your First AI Hashcenter for the bridge.
Benchmark history
Last benchmarked: December 12, 2024 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MATH | 80.4 | vendor_blog ✓ | December 12, 2024 |
| GPQA | 56.1 | vendor_blog ✓ | December 12, 2024 |
| HumanEval | 82.6 | vendor_blog ✓ | December 12, 2024 |
| MMLU | 84.8 | vendor_blog ✓ | December 12, 2024 |
Recommended hardware
Runs on 16 GB VRAM — 4070 Ti or M3 Pro. Quantized Q4 fits comfortably.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
