AI Leaderboards
Benchmarks & Hardware
A running scoreboard for self-hosted AI — which open models are tested against what, and which pieces of silicon make a pleb Hashcenter hum. All data comes from the model creators and silicon vendors themselves.
LLM benchmark coverage
Which models in our catalogue have been tested against each benchmark. Scores are published on release by each model's creator — we don't re-run evals. Hit the model page for the creator's full number.
AIME-2024 (6 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Qwen 3 | 85.7 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 85.7 | Qwen | 235 | 131K | Apache-2.0 |
| DeepSeek R1 | 79.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 79.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek V3 | 39.2 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 39.2 | DeepSeek | 671 | 128K | DeepSeek License |
Source: scores published on release by each model's creator for AIME-2024.
GPQA (24 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Qwen 3 | 77.5 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 77.5 | Qwen | 235 | 131K | Apache-2.0 |
| DeepSeek R1 | 71.5 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 71.5 | DeepSeek | 671 | 128K | MIT (most distills) |
| Llama 4 (Scout/Maverick) | 69.8 | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | 69.8 | Llama | — | 10,000K | Llama 4 Community |
| DeepSeek V3 | 59.1 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 59.1 | DeepSeek | 671 | 128K | DeepSeek License |
| Phi-4 | 56.1 | Phi | 14 | 16K | MIT |
| Phi-4 | 56.1 | Phi | 14 | 16K | MIT |
| Llama 3.1 | 50.7 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 50.7 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.3 | 50.5 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 50.5 | Llama | 70 | 128K | Llama 3.3 Community |
| Qwen 2.5 | 49.0 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 49.0 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Mistral Small 3 | 45.3 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 45.3 | Mistral | 24 | 33K | Apache-2.0 |
| Gemma 3 | 24.3 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 24.3 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
Source: scores published on release by each model's creator for GPQA.
HumanEval (28 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Llama 3.1 | 89.0 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 89.0 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.3 | 88.4 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 88.4 | Llama | 70 | 128K | Llama 3.3 Community |
| Qwen 2.5 | 86.6 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 86.6 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Mistral Small 3 | 84.8 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 84.8 | Mistral | 24 | 33K | Apache-2.0 |
| DeepSeek V3 | 82.6 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 82.6 | DeepSeek | 671 | 128K | DeepSeek License |
| Phi-4 | 82.6 | Phi | 14 | 16K | MIT |
| Phi-4 | 82.6 | Phi | 14 | 16K | MIT |
| Gemma 2 | 51.8 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | 51.8 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 3 | 48.8 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 48.8 | Gemma | 27 | 128K | Gemma Terms |
| Mixtral 8x7B | 40.2 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 40.2 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 30.5 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 30.5 | Mistral | 7 | 33K | Apache-2.0 |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
Source: scores published on release by each model's creator for HumanEval.
MATH (24 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| DeepSeek R1 | 97.3 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 97.3 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek V3 | 90.2 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 90.2 | DeepSeek | 671 | 128K | DeepSeek License |
| Qwen 2.5 | 83.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 83.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Phi-4 | 80.4 | Phi | 14 | 16K | MIT |
| Phi-4 | 80.4 | Phi | 14 | 16K | MIT |
| Llama 3.3 | 77.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 77.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.1 | 73.8 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 73.8 | Llama | 405 | 128K | Llama 3.1 Community |
| Qwen 3 | 71.8 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 71.8 | Qwen | 235 | 131K | Apache-2.0 |
| Mistral Small 3 | 70.6 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 70.6 | Mistral | 24 | 33K | Apache-2.0 |
| Gemma 3 | 50.0 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 50.0 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 2 | 42.3 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | 42.3 | Gemma | 27 | 8K | Gemma Terms |
| Mixtral 8x7B | 28.4 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 28.4 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 13.1 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 13.1 | Mistral | 7 | 33K | Apache-2.0 |
Source: scores published on release by each model's creator for MATH.
MMLU (28 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| DeepSeek R1 | 90.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 90.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| Qwen 3 | 88.7 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 88.7 | Qwen | 235 | 131K | Apache-2.0 |
| DeepSeek V3 | 88.5 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 88.5 | DeepSeek | 671 | 128K | DeepSeek License |
| Llama 3.1 | 87.3 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 87.3 | Llama | 405 | 128K | Llama 3.1 Community |
| Qwen 2.5 | 86.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 86.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Llama 3.3 | 86.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 86.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Phi-4 | 84.8 | Phi | 14 | 16K | MIT |
| Phi-4 | 84.8 | Phi | 14 | 16K | MIT |
| Gemma 3 | 78.6 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 78.6 | Gemma | 27 | 128K | Gemma Terms |
| Command R+ | 75.7 | Command | 104 | 128K | CC-BY-NC |
| Command R+ | 75.7 | Command | 104 | 128K | CC-BY-NC |
| Gemma 2 | 75.2 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | 75.2 | Gemma | 27 | 8K | Gemma Terms |
| Mixtral 8x7B | 70.6 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 70.6 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 60.1 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 60.1 | Mistral | 7 | 33K | Apache-2.0 |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
Source: scores published on release by each model's creator for MMLU.
MMLU-Pro (2 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Mistral Small 3 | 66.3 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 66.3 | Mistral | 24 | 33K | Apache-2.0 |
Source: scores published on release by each model's creator for MMLU-Pro.
MT-Bench (28 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Qwen 2.5 | 9.4 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 9.4 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Mistral Small 3 | 8.4 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 8.4 | Mistral | 24 | 33K | Apache-2.0 |
| Mixtral 8x7B | 8.3 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 8.3 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 6.8 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 6.8 | Mistral | 7 | 33K | Apache-2.0 |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek V3 | — | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | — | DeepSeek | 671 | 128K | DeepSeek License |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Gemma 3 | — | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | — | Gemma | 27 | 128K | Gemma Terms |
| Llama 3.1 | — | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | — | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.3 | — | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | — | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Phi-4 | — | Phi | 14 | 16K | MIT |
| Phi-4 | — | Phi | 14 | 16K | MIT |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
Source: scores published on release by each model's creator for MT-Bench.
Hardware leaderboard
Every card and appliance in the database, stacked on three axes. VRAM is king for 70B-class models; bandwidth rules token throughput; TDP decides what your 120V circuit can tolerate.
VRAM (GB) — raw capacity
TDP (watts) — 120V circuit impact
FP16 TFLOPS — raw throughput
Bang per buck (FP16 TFLOPS vs street price)
Higher and to the left is better. Bottom-right = premium territory.
A note on Hashcenters
These numbers are for owner-operated Hashcenters — a rack in your garage, a GPU pair under your desk, a Mac Studio on the shelf. Rented cloud capacity lives by different rules (zero control, rising rates, someone else's kill switch). If you're sizing a heating setup instead of a server farm, start with Heating with Inference.
The Hashcenter — owner-operated, pleb-scale, sovereign workload — is the alternative to the hyperscaler AI datacenter. See the Sovereign AI for Bitcoiners Manifesto for why, From S19 to Your First AI Hashcenter for how, and Used RTX 3090 for LLMs in 2026 for what to buy.
Charts rendered with Chart.js (MIT). Standing on the shoulders of every vendor and model creator who published the underlying numbers.
