AI Leaderboards
Benchmarks & Hardware
Read Rankings as Snapshots
Leaderboards are snapshots of specific tests, versions, prompts, runtimes, and hardware. They should help shortlist options, not replace local testing with the actual data, latency, privacy, and maintenance constraints of the deployment.
Use With Hardware Context
Pair rankings with model license, VRAM requirement, quantization quality, driver support, energy use, heat, and noise before buying hardware or standardizing a workflow.
Last reviewed May 24, 2026.
Source Basis
AI pages should cite model cards, project repositories, release notes, hardware vendor specifications, driver/runtime documentation, and D-Central infrastructure experience. Benchmark claims should preserve model version, quantization, context length, hardware, driver, runtime, and test date.
ASIC miners do not run LLM workloads. D-Central connects AI to its practical infrastructure domain through power, heat, privacy, local compute, maintenance, and hardware operations.
Reviewer
Reviewed by D-Central editorial staff with a Bitcoin infrastructure, privacy, hardware, and operations lens. Sensitive data, private keys, customer records, and production secrets should not be loaded into experimental AI stacks.
Freshness Policy
Model releases, licenses, GPU prices, driver support, inference runtimes, and leaderboard results change quickly. AI pages should preserve the model and hardware version used, identify stale benchmarks, and separate local privacy guidance from performance claims.
Last reviewed May 24, 2026. D-Central editorial and repair intake, Laval, Quebec.
A running scoreboard for self-hosted AI — which open models are tested against what, and which pieces of silicon make a pleb Hashcenter hum. All data comes from the model creators and silicon vendors themselves.
LLM benchmark coverage
Which models in our catalogue have been tested against each benchmark. Scores are published on release by each model's creator — we don't re-run evals. Hit the model page for the creator's full number.
AIME-2024 (6 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Qwen 3 | 85.7 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 85.7 | Qwen | 235 | 131K | Apache-2.0 |
| DeepSeek R1 | 79.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 79.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek V3 | 39.2 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 39.2 | DeepSeek | 671 | 128K | DeepSeek License |
Source: scores published on release by each model's creator for AIME-2024.
GPQA (24 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Qwen 3 | 77.5 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 77.5 | Qwen | 235 | 131K | Apache-2.0 |
| DeepSeek R1 | 71.5 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 71.5 | DeepSeek | 671 | 128K | MIT (most distills) |
| Llama 4 (Scout/Maverick) | 69.8 | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | 69.8 | Llama | — | 10,000K | Llama 4 Community |
| DeepSeek V3 | 59.1 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 59.1 | DeepSeek | 671 | 128K | DeepSeek License |
| Phi-4 | 56.1 | Phi | 14 | 16K | MIT |
| Phi-4 | 56.1 | Phi | 14 | 16K | MIT |
| Llama 3.1 | 50.7 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 50.7 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.3 | 50.5 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 50.5 | Llama | 70 | 128K | Llama 3.3 Community |
| Qwen 2.5 | 49.0 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 49.0 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Mistral Small 3 | 45.3 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 45.3 | Mistral | 24 | 33K | Apache-2.0 |
| Gemma 3 | 24.3 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 24.3 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
Source: scores published on release by each model's creator for GPQA.
HumanEval (28 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Llama 3.1 | 89.0 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 89.0 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.3 | 88.4 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 88.4 | Llama | 70 | 128K | Llama 3.3 Community |
| Qwen 2.5 | 86.6 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 86.6 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Mistral Small 3 | 84.8 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 84.8 | Mistral | 24 | 33K | Apache-2.0 |
| DeepSeek V3 | 82.6 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 82.6 | DeepSeek | 671 | 128K | DeepSeek License |
| Phi-4 | 82.6 | Phi | 14 | 16K | MIT |
| Phi-4 | 82.6 | Phi | 14 | 16K | MIT |
| Gemma 2 | 51.8 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | 51.8 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 3 | 48.8 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 48.8 | Gemma | 27 | 128K | Gemma Terms |
| Mixtral 8x7B | 40.2 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 40.2 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 30.5 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 30.5 | Mistral | 7 | 33K | Apache-2.0 |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
Source: scores published on release by each model's creator for HumanEval.
MATH (24 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| DeepSeek R1 | 97.3 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 97.3 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek V3 | 90.2 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 90.2 | DeepSeek | 671 | 128K | DeepSeek License |
| Qwen 2.5 | 83.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 83.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Phi-4 | 80.4 | Phi | 14 | 16K | MIT |
| Phi-4 | 80.4 | Phi | 14 | 16K | MIT |
| Llama 3.3 | 77.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 77.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.1 | 73.8 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 73.8 | Llama | 405 | 128K | Llama 3.1 Community |
| Qwen 3 | 71.8 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 71.8 | Qwen | 235 | 131K | Apache-2.0 |
| Mistral Small 3 | 70.6 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 70.6 | Mistral | 24 | 33K | Apache-2.0 |
| Gemma 3 | 50.0 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 50.0 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 2 | 42.3 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | 42.3 | Gemma | 27 | 8K | Gemma Terms |
| Mixtral 8x7B | 28.4 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 28.4 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 13.1 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 13.1 | Mistral | 7 | 33K | Apache-2.0 |
Source: scores published on release by each model's creator for MATH.
MMLU (28 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| DeepSeek R1 | 90.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | 90.8 | DeepSeek | 671 | 128K | MIT (most distills) |
| Qwen 3 | 88.7 | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | 88.7 | Qwen | 235 | 131K | Apache-2.0 |
| DeepSeek V3 | 88.5 | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | 88.5 | DeepSeek | 671 | 128K | DeepSeek License |
| Llama 3.1 | 87.3 | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | 87.3 | Llama | 405 | 128K | Llama 3.1 Community |
| Qwen 2.5 | 86.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 86.1 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Llama 3.3 | 86.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | 86.0 | Llama | 70 | 128K | Llama 3.3 Community |
| Phi-4 | 84.8 | Phi | 14 | 16K | MIT |
| Phi-4 | 84.8 | Phi | 14 | 16K | MIT |
| Gemma 3 | 78.6 | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | 78.6 | Gemma | 27 | 128K | Gemma Terms |
| Command R+ | 75.7 | Command | 104 | 128K | CC-BY-NC |
| Command R+ | 75.7 | Command | 104 | 128K | CC-BY-NC |
| Gemma 2 | 75.2 | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | 75.2 | Gemma | 27 | 8K | Gemma Terms |
| Mixtral 8x7B | 70.6 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 70.6 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 60.1 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 60.1 | Mistral | 7 | 33K | Apache-2.0 |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
Source: scores published on release by each model's creator for MMLU.
MMLU-Pro (2 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Mistral Small 3 | 66.3 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 66.3 | Mistral | 24 | 33K | Apache-2.0 |
Source: scores published on release by each model's creator for MMLU-Pro.
MT-Bench (28 models tagged)
| Model | Score | Family | Max Params (B) | Context | License |
|---|---|---|---|---|---|
| Qwen 2.5 | 9.4 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Qwen 2.5 | 9.4 | Qwen | 72 | 128K | Apache-2.0 (most sizes) |
| Mistral Small 3 | 8.4 | Mistral | 24 | 33K | Apache-2.0 |
| Mistral Small 3 | 8.4 | Mistral | 24 | 33K | Apache-2.0 |
| Mixtral 8x7B | 8.3 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mixtral 8x7B | 8.3 | Mistral | 46.7 | 33K | Apache-2.0 |
| Mistral 7B | 6.8 | Mistral | 7 | 33K | Apache-2.0 |
| Mistral 7B | 6.8 | Mistral | 7 | 33K | Apache-2.0 |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek R1 | — | DeepSeek | 671 | 128K | MIT (most distills) |
| DeepSeek V3 | — | DeepSeek | 671 | 128K | DeepSeek License |
| DeepSeek V3 | — | DeepSeek | 671 | 128K | DeepSeek License |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Gemma 2 | — | Gemma | 27 | 8K | Gemma Terms |
| Gemma 3 | — | Gemma | 27 | 128K | Gemma Terms |
| Gemma 3 | — | Gemma | 27 | 128K | Gemma Terms |
| Llama 3.1 | — | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.1 | — | Llama | 405 | 128K | Llama 3.1 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.2 | — | Llama | 90 | 128K | Llama 3.2 Community |
| Llama 3.3 | — | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 3.3 | — | Llama | 70 | 128K | Llama 3.3 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Llama 4 (Scout/Maverick) | — | Llama | — | 10,000K | Llama 4 Community |
| Phi-4 | — | Phi | 14 | 16K | MIT |
| Phi-4 | — | Phi | 14 | 16K | MIT |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
| Qwen 3 | — | Qwen | 235 | 131K | Apache-2.0 |
Source: scores published on release by each model's creator for MT-Bench.
Coding agents
Terminal- and editor-native agents that actually write and run code. For a pleb Hashcenter, the columns that matter are not the leaderboard hype — they're whether you can read the source and whether it runs air-gapped against your own local model. A closed agent that phones a vendor on every keystroke is a kill switch waiting to happen.
| Agent | Open-source? | Runs locally / air-gappable? | License | Stars | SWE-bench Verified |
|---|---|---|---|---|---|
| Claude Code | No (proprietary) | No — needs Anthropic API/login | Proprietary | — | Model-dependent (Claude) |
| Codex CLI | Yes | No — OpenAI models only | Apache-2.0 | ~75k | Model-dependent (GPT/Codex) |
| Aider | Yes | Yes — any local model (Ollama, etc.) | Apache-2.0 | 44k | Model-dependent (BYO LLM) |
| Cline | Yes | Yes — Ollama / LM Studio | Apache-2.0 | 62.7k | Model-dependent (BYO LLM) |
| Continue | Yes | Yes — Ollama / local models | Apache-2.0 | 33.5k | Model-dependent (BYO LLM) |
| OpenHands | Yes | Yes — any OpenAI-compatible/local | MIT | 40k+ | See source (scaffold ~66–77%) |
Open-source status, license and star counts read from each project's own GitHub repo. SWE-bench Verified is a score of the underlying model, not the harness — these agents are model-agnostic, so we cite the source rather than pin a number to the tool. Data as of June 2026, verify quarterly.
Agent frameworks
Orchestration toolkits for wiring agents together — the plumbing behind MCP servers and multi-agent crews. The sovereign questions: can you self-host it, point it at a local LLM, run it on your own node, and let it pay for what it consumes over Lightning (L402) instead of a corporate card? Almost nothing does the last one yet — which is exactly why it matters.
| Framework | Self-host? | Local-LLM? | Runs on your node? | Lightning / L402-capable? | License | Stars |
|---|---|---|---|---|---|---|
| Hermes Agent | Yes | Yes | Yes | No | MIT | 179k |
| AutoGPT | Yes | Partial | Yes | No | MIT (+Polyform platform) | 185k |
| AutoGen | Yes | Yes | Yes | No | MIT / CC-BY-4.0 | 58.7k |
| CrewAI | Yes | Yes | Yes | No | MIT | 52.8k |
| LangGraph | Yes | Yes | Yes | No | MIT | 33.8k |
| Lightning Agent Tools | Yes | Yes | Yes | Yes | See source | — |
License and star counts read from each project's own GitHub repo. "L402-capable" means the toolkit can pay for or gate APIs over the Lightning Network natively — most general frameworks cannot, and bolt-on payment skills are not counted. Lightning Agent Tools (Lightning Labs) is the reference L402 toolkit. Data as of June 2026, verify quarterly.
Hardware leaderboard
Every card and appliance in the database, stacked on three axes. VRAM is king for 70B-class models; bandwidth rules token throughput; TDP decides what your 120V circuit can tolerate.
VRAM (GB) — raw capacity
TDP (watts) — 120V circuit impact
FP16 TFLOPS — raw throughput
Bang per buck (FP16 TFLOPS vs street price)
Higher and to the left is better. Bottom-right = premium territory.
A note on Hashcenters
These numbers are for owner-operated Hashcenters — a rack in your garage, a GPU pair under your desk, a Mac Studio on the shelf. Rented cloud capacity lives by different rules (zero control, rising rates, someone else's kill switch). If you're sizing a heating setup instead of a server farm, start with Heating with Inference.
The Hashcenter — owner-operated, pleb-scale, sovereign workload — is the alternative to the hyperscaler AI datacenter. See the Sovereign AI for Bitcoiners Manifesto for why, From S19 to Your First AI Hashcenter for how, and Used RTX 3090 for LLMs in 2026 for what to buy.
Charts rendered with Chart.js (MIT). Standing on the shoulders of every vendor and model creator who published the underlying numbers.
