Skip to content

We're upgrading our operations to serve you better. Orders ship as usual from Laval, QC. Questions? Contact us

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

AI Leaderboards

Benchmarks & Hardware

A running scoreboard for self-hosted AI — which open models are tested against what, and which pieces of silicon make a pleb Hashcenter hum. All data comes from the model creators and silicon vendors themselves.

LLM benchmark coverage

Which models in our catalogue have been tested against each benchmark. Scores are published on release by each model's creator — we don't re-run evals. Hit the model page for the creator's full number.

AIME-2024  (6 models tagged)

Model Score Family Max Params (B) Context License
Qwen 3 85.7 Qwen 235 131K Apache-2.0
Qwen 3 85.7 Qwen 235 131K Apache-2.0
DeepSeek R1 79.8 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 79.8 DeepSeek 671 128K MIT (most distills)
DeepSeek V3 39.2 DeepSeek 671 128K DeepSeek License
DeepSeek V3 39.2 DeepSeek 671 128K DeepSeek License

Source: scores published on release by each model's creator for AIME-2024.

GPQA  (24 models tagged)

Model Score Family Max Params (B) Context License
Qwen 3 77.5 Qwen 235 131K Apache-2.0
Qwen 3 77.5 Qwen 235 131K Apache-2.0
DeepSeek R1 71.5 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 71.5 DeepSeek 671 128K MIT (most distills)
Llama 4 (Scout/Maverick) 69.8 Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) 69.8 Llama 10,000K Llama 4 Community
DeepSeek V3 59.1 DeepSeek 671 128K DeepSeek License
DeepSeek V3 59.1 DeepSeek 671 128K DeepSeek License
Phi-4 56.1 Phi 14 16K MIT
Phi-4 56.1 Phi 14 16K MIT
Llama 3.1 50.7 Llama 405 128K Llama 3.1 Community
Llama 3.1 50.7 Llama 405 128K Llama 3.1 Community
Llama 3.3 50.5 Llama 70 128K Llama 3.3 Community
Llama 3.3 50.5 Llama 70 128K Llama 3.3 Community
Qwen 2.5 49.0 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 49.0 Qwen 72 128K Apache-2.0 (most sizes)
Mistral Small 3 45.3 Mistral 24 33K Apache-2.0
Mistral Small 3 45.3 Mistral 24 33K Apache-2.0
Gemma 3 24.3 Gemma 27 128K Gemma Terms
Gemma 3 24.3 Gemma 27 128K Gemma Terms
Gemma 2 Gemma 27 8K Gemma Terms
Gemma 2 Gemma 27 8K Gemma Terms
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community

Source: scores published on release by each model's creator for GPQA.

HumanEval  (28 models tagged)

Model Score Family Max Params (B) Context License
Llama 3.1 89.0 Llama 405 128K Llama 3.1 Community
Llama 3.1 89.0 Llama 405 128K Llama 3.1 Community
Llama 3.3 88.4 Llama 70 128K Llama 3.3 Community
Llama 3.3 88.4 Llama 70 128K Llama 3.3 Community
Qwen 2.5 86.6 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 86.6 Qwen 72 128K Apache-2.0 (most sizes)
Mistral Small 3 84.8 Mistral 24 33K Apache-2.0
Mistral Small 3 84.8 Mistral 24 33K Apache-2.0
DeepSeek V3 82.6 DeepSeek 671 128K DeepSeek License
DeepSeek V3 82.6 DeepSeek 671 128K DeepSeek License
Phi-4 82.6 Phi 14 16K MIT
Phi-4 82.6 Phi 14 16K MIT
Gemma 2 51.8 Gemma 27 8K Gemma Terms
Gemma 2 51.8 Gemma 27 8K Gemma Terms
Gemma 3 48.8 Gemma 27 128K Gemma Terms
Gemma 3 48.8 Gemma 27 128K Gemma Terms
Mixtral 8x7B 40.2 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 40.2 Mistral 46.7 33K Apache-2.0
Mistral 7B 30.5 Mistral 7 33K Apache-2.0
Mistral 7B 30.5 Mistral 7 33K Apache-2.0
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Qwen 3 Qwen 235 131K Apache-2.0
Qwen 3 Qwen 235 131K Apache-2.0

Source: scores published on release by each model's creator for HumanEval.

MATH  (24 models tagged)

Model Score Family Max Params (B) Context License
DeepSeek R1 97.3 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 97.3 DeepSeek 671 128K MIT (most distills)
DeepSeek V3 90.2 DeepSeek 671 128K DeepSeek License
DeepSeek V3 90.2 DeepSeek 671 128K DeepSeek License
Qwen 2.5 83.1 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 83.1 Qwen 72 128K Apache-2.0 (most sizes)
Phi-4 80.4 Phi 14 16K MIT
Phi-4 80.4 Phi 14 16K MIT
Llama 3.3 77.0 Llama 70 128K Llama 3.3 Community
Llama 3.3 77.0 Llama 70 128K Llama 3.3 Community
Llama 3.1 73.8 Llama 405 128K Llama 3.1 Community
Llama 3.1 73.8 Llama 405 128K Llama 3.1 Community
Qwen 3 71.8 Qwen 235 131K Apache-2.0
Qwen 3 71.8 Qwen 235 131K Apache-2.0
Mistral Small 3 70.6 Mistral 24 33K Apache-2.0
Mistral Small 3 70.6 Mistral 24 33K Apache-2.0
Gemma 3 50.0 Gemma 27 128K Gemma Terms
Gemma 3 50.0 Gemma 27 128K Gemma Terms
Gemma 2 42.3 Gemma 27 8K Gemma Terms
Gemma 2 42.3 Gemma 27 8K Gemma Terms
Mixtral 8x7B 28.4 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 28.4 Mistral 46.7 33K Apache-2.0
Mistral 7B 13.1 Mistral 7 33K Apache-2.0
Mistral 7B 13.1 Mistral 7 33K Apache-2.0

Source: scores published on release by each model's creator for MATH.

MMLU  (28 models tagged)

Model Score Family Max Params (B) Context License
DeepSeek R1 90.8 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 90.8 DeepSeek 671 128K MIT (most distills)
Qwen 3 88.7 Qwen 235 131K Apache-2.0
Qwen 3 88.7 Qwen 235 131K Apache-2.0
DeepSeek V3 88.5 DeepSeek 671 128K DeepSeek License
DeepSeek V3 88.5 DeepSeek 671 128K DeepSeek License
Llama 3.1 87.3 Llama 405 128K Llama 3.1 Community
Llama 3.1 87.3 Llama 405 128K Llama 3.1 Community
Qwen 2.5 86.1 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 86.1 Qwen 72 128K Apache-2.0 (most sizes)
Llama 3.3 86.0 Llama 70 128K Llama 3.3 Community
Llama 3.3 86.0 Llama 70 128K Llama 3.3 Community
Phi-4 84.8 Phi 14 16K MIT
Phi-4 84.8 Phi 14 16K MIT
Gemma 3 78.6 Gemma 27 128K Gemma Terms
Gemma 3 78.6 Gemma 27 128K Gemma Terms
Command R+ 75.7 Command 104 128K CC-BY-NC
Command R+ 75.7 Command 104 128K CC-BY-NC
Gemma 2 75.2 Gemma 27 8K Gemma Terms
Gemma 2 75.2 Gemma 27 8K Gemma Terms
Mixtral 8x7B 70.6 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 70.6 Mistral 46.7 33K Apache-2.0
Mistral 7B 60.1 Mistral 7 33K Apache-2.0
Mistral 7B 60.1 Mistral 7 33K Apache-2.0
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community

Source: scores published on release by each model's creator for MMLU.

MMLU-Pro  (2 models tagged)

Model Score Family Max Params (B) Context License
Mistral Small 3 66.3 Mistral 24 33K Apache-2.0
Mistral Small 3 66.3 Mistral 24 33K Apache-2.0

Source: scores published on release by each model's creator for MMLU-Pro.

MT-Bench  (28 models tagged)

Model Score Family Max Params (B) Context License
Qwen 2.5 9.4 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 9.4 Qwen 72 128K Apache-2.0 (most sizes)
Mistral Small 3 8.4 Mistral 24 33K Apache-2.0
Mistral Small 3 8.4 Mistral 24 33K Apache-2.0
Mixtral 8x7B 8.3 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 8.3 Mistral 46.7 33K Apache-2.0
Mistral 7B 6.8 Mistral 7 33K Apache-2.0
Mistral 7B 6.8 Mistral 7 33K Apache-2.0
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
DeepSeek V3 DeepSeek 671 128K DeepSeek License
DeepSeek V3 DeepSeek 671 128K DeepSeek License
Gemma 2 Gemma 27 8K Gemma Terms
Gemma 2 Gemma 27 8K Gemma Terms
Gemma 3 Gemma 27 128K Gemma Terms
Gemma 3 Gemma 27 128K Gemma Terms
Llama 3.1 Llama 405 128K Llama 3.1 Community
Llama 3.1 Llama 405 128K Llama 3.1 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.3 Llama 70 128K Llama 3.3 Community
Llama 3.3 Llama 70 128K Llama 3.3 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Phi-4 Phi 14 16K MIT
Phi-4 Phi 14 16K MIT
Qwen 3 Qwen 235 131K Apache-2.0
Qwen 3 Qwen 235 131K Apache-2.0

Source: scores published on release by each model's creator for MT-Bench.

Hardware leaderboard

Every card and appliance in the database, stacked on three axes. VRAM is king for 70B-class models; bandwidth rules token throughput; TDP decides what your 120V circuit can tolerate.

VRAM (GB) — raw capacity

TDP (watts) — 120V circuit impact

FP16 TFLOPS — raw throughput

Bang per buck (FP16 TFLOPS vs street price)

Higher and to the left is better. Bottom-right = premium territory.

A note on Hashcenters

These numbers are for owner-operated Hashcenters — a rack in your garage, a GPU pair under your desk, a Mac Studio on the shelf. Rented cloud capacity lives by different rules (zero control, rising rates, someone else's kill switch). If you're sizing a heating setup instead of a server farm, start with Heating with Inference.

The Hashcenter — owner-operated, pleb-scale, sovereign workload — is the alternative to the hyperscaler AI datacenter. See the Sovereign AI for Bitcoiners Manifesto for why, From S19 to Your First AI Hashcenter for how, and Used RTX 3090 for LLMs in 2026 for what to buy.

Charts rendered with Chart.js (MIT). Standing on the shoulders of every vendor and model creator who published the underlying numbers.