Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

AI Leaderboards

Benchmarks & Hardware

Read Rankings as Snapshots

Leaderboards are snapshots of specific tests, versions, prompts, runtimes, and hardware. They should help shortlist options, not replace local testing with the actual data, latency, privacy, and maintenance constraints of the deployment.

Use With Hardware Context

Pair rankings with model license, VRAM requirement, quantization quality, driver support, energy use, heat, and noise before buying hardware or standardizing a workflow.

Source Basis

AI pages should cite model cards, project repositories, release notes, hardware vendor specifications, driver/runtime documentation, and D-Central infrastructure experience. Benchmark claims should preserve model version, quantization, context length, hardware, driver, runtime, and test date.

ASIC miners do not run LLM workloads. D-Central connects AI to its practical infrastructure domain through power, heat, privacy, local compute, maintenance, and hardware operations.

Reviewer

Reviewed by D-Central editorial staff with a Bitcoin infrastructure, privacy, hardware, and operations lens. Sensitive data, private keys, customer records, and production secrets should not be loaded into experimental AI stacks.

Freshness Policy

Model releases, licenses, GPU prices, driver support, inference runtimes, and leaderboard results change quickly. AI pages should preserve the model and hardware version used, identify stale benchmarks, and separate local privacy guidance from performance claims.

A running scoreboard for self-hosted AI — which open models are tested against what, and which pieces of silicon make a pleb Hashcenter hum. All data comes from the model creators and silicon vendors themselves.

LLM benchmark coverage

Which models in our catalogue have been tested against each benchmark. Scores are published on release by each model's creator — we don't re-run evals. Hit the model page for the creator's full number.

AIME-2024  (6 models tagged)

Model Score Family Max Params (B) Context License
Qwen 3 85.7 Qwen 235 131K Apache-2.0
Qwen 3 85.7 Qwen 235 131K Apache-2.0
DeepSeek R1 79.8 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 79.8 DeepSeek 671 128K MIT (most distills)
DeepSeek V3 39.2 DeepSeek 671 128K DeepSeek License
DeepSeek V3 39.2 DeepSeek 671 128K DeepSeek License

Source: scores published on release by each model's creator for AIME-2024.

GPQA  (24 models tagged)

Model Score Family Max Params (B) Context License
Qwen 3 77.5 Qwen 235 131K Apache-2.0
Qwen 3 77.5 Qwen 235 131K Apache-2.0
DeepSeek R1 71.5 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 71.5 DeepSeek 671 128K MIT (most distills)
Llama 4 (Scout/Maverick) 69.8 Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) 69.8 Llama 10,000K Llama 4 Community
DeepSeek V3 59.1 DeepSeek 671 128K DeepSeek License
DeepSeek V3 59.1 DeepSeek 671 128K DeepSeek License
Phi-4 56.1 Phi 14 16K MIT
Phi-4 56.1 Phi 14 16K MIT
Llama 3.1 50.7 Llama 405 128K Llama 3.1 Community
Llama 3.1 50.7 Llama 405 128K Llama 3.1 Community
Llama 3.3 50.5 Llama 70 128K Llama 3.3 Community
Llama 3.3 50.5 Llama 70 128K Llama 3.3 Community
Qwen 2.5 49.0 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 49.0 Qwen 72 128K Apache-2.0 (most sizes)
Mistral Small 3 45.3 Mistral 24 33K Apache-2.0
Mistral Small 3 45.3 Mistral 24 33K Apache-2.0
Gemma 3 24.3 Gemma 27 128K Gemma Terms
Gemma 3 24.3 Gemma 27 128K Gemma Terms
Gemma 2 Gemma 27 8K Gemma Terms
Gemma 2 Gemma 27 8K Gemma Terms
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community

Source: scores published on release by each model's creator for GPQA.

HumanEval  (28 models tagged)

Model Score Family Max Params (B) Context License
Llama 3.1 89.0 Llama 405 128K Llama 3.1 Community
Llama 3.1 89.0 Llama 405 128K Llama 3.1 Community
Llama 3.3 88.4 Llama 70 128K Llama 3.3 Community
Llama 3.3 88.4 Llama 70 128K Llama 3.3 Community
Qwen 2.5 86.6 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 86.6 Qwen 72 128K Apache-2.0 (most sizes)
Mistral Small 3 84.8 Mistral 24 33K Apache-2.0
Mistral Small 3 84.8 Mistral 24 33K Apache-2.0
DeepSeek V3 82.6 DeepSeek 671 128K DeepSeek License
DeepSeek V3 82.6 DeepSeek 671 128K DeepSeek License
Phi-4 82.6 Phi 14 16K MIT
Phi-4 82.6 Phi 14 16K MIT
Gemma 2 51.8 Gemma 27 8K Gemma Terms
Gemma 2 51.8 Gemma 27 8K Gemma Terms
Gemma 3 48.8 Gemma 27 128K Gemma Terms
Gemma 3 48.8 Gemma 27 128K Gemma Terms
Mixtral 8x7B 40.2 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 40.2 Mistral 46.7 33K Apache-2.0
Mistral 7B 30.5 Mistral 7 33K Apache-2.0
Mistral 7B 30.5 Mistral 7 33K Apache-2.0
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Qwen 3 Qwen 235 131K Apache-2.0
Qwen 3 Qwen 235 131K Apache-2.0

Source: scores published on release by each model's creator for HumanEval.

MATH  (24 models tagged)

Model Score Family Max Params (B) Context License
DeepSeek R1 97.3 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 97.3 DeepSeek 671 128K MIT (most distills)
DeepSeek V3 90.2 DeepSeek 671 128K DeepSeek License
DeepSeek V3 90.2 DeepSeek 671 128K DeepSeek License
Qwen 2.5 83.1 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 83.1 Qwen 72 128K Apache-2.0 (most sizes)
Phi-4 80.4 Phi 14 16K MIT
Phi-4 80.4 Phi 14 16K MIT
Llama 3.3 77.0 Llama 70 128K Llama 3.3 Community
Llama 3.3 77.0 Llama 70 128K Llama 3.3 Community
Llama 3.1 73.8 Llama 405 128K Llama 3.1 Community
Llama 3.1 73.8 Llama 405 128K Llama 3.1 Community
Qwen 3 71.8 Qwen 235 131K Apache-2.0
Qwen 3 71.8 Qwen 235 131K Apache-2.0
Mistral Small 3 70.6 Mistral 24 33K Apache-2.0
Mistral Small 3 70.6 Mistral 24 33K Apache-2.0
Gemma 3 50.0 Gemma 27 128K Gemma Terms
Gemma 3 50.0 Gemma 27 128K Gemma Terms
Gemma 2 42.3 Gemma 27 8K Gemma Terms
Gemma 2 42.3 Gemma 27 8K Gemma Terms
Mixtral 8x7B 28.4 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 28.4 Mistral 46.7 33K Apache-2.0
Mistral 7B 13.1 Mistral 7 33K Apache-2.0
Mistral 7B 13.1 Mistral 7 33K Apache-2.0

Source: scores published on release by each model's creator for MATH.

MMLU  (28 models tagged)

Model Score Family Max Params (B) Context License
DeepSeek R1 90.8 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 90.8 DeepSeek 671 128K MIT (most distills)
Qwen 3 88.7 Qwen 235 131K Apache-2.0
Qwen 3 88.7 Qwen 235 131K Apache-2.0
DeepSeek V3 88.5 DeepSeek 671 128K DeepSeek License
DeepSeek V3 88.5 DeepSeek 671 128K DeepSeek License
Llama 3.1 87.3 Llama 405 128K Llama 3.1 Community
Llama 3.1 87.3 Llama 405 128K Llama 3.1 Community
Qwen 2.5 86.1 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 86.1 Qwen 72 128K Apache-2.0 (most sizes)
Llama 3.3 86.0 Llama 70 128K Llama 3.3 Community
Llama 3.3 86.0 Llama 70 128K Llama 3.3 Community
Phi-4 84.8 Phi 14 16K MIT
Phi-4 84.8 Phi 14 16K MIT
Gemma 3 78.6 Gemma 27 128K Gemma Terms
Gemma 3 78.6 Gemma 27 128K Gemma Terms
Command R+ 75.7 Command 104 128K CC-BY-NC
Command R+ 75.7 Command 104 128K CC-BY-NC
Gemma 2 75.2 Gemma 27 8K Gemma Terms
Gemma 2 75.2 Gemma 27 8K Gemma Terms
Mixtral 8x7B 70.6 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 70.6 Mistral 46.7 33K Apache-2.0
Mistral 7B 60.1 Mistral 7 33K Apache-2.0
Mistral 7B 60.1 Mistral 7 33K Apache-2.0
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community

Source: scores published on release by each model's creator for MMLU.

MMLU-Pro  (2 models tagged)

Model Score Family Max Params (B) Context License
Mistral Small 3 66.3 Mistral 24 33K Apache-2.0
Mistral Small 3 66.3 Mistral 24 33K Apache-2.0

Source: scores published on release by each model's creator for MMLU-Pro.

MT-Bench  (28 models tagged)

Model Score Family Max Params (B) Context License
Qwen 2.5 9.4 Qwen 72 128K Apache-2.0 (most sizes)
Qwen 2.5 9.4 Qwen 72 128K Apache-2.0 (most sizes)
Mistral Small 3 8.4 Mistral 24 33K Apache-2.0
Mistral Small 3 8.4 Mistral 24 33K Apache-2.0
Mixtral 8x7B 8.3 Mistral 46.7 33K Apache-2.0
Mixtral 8x7B 8.3 Mistral 46.7 33K Apache-2.0
Mistral 7B 6.8 Mistral 7 33K Apache-2.0
Mistral 7B 6.8 Mistral 7 33K Apache-2.0
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
DeepSeek R1 DeepSeek 671 128K MIT (most distills)
DeepSeek V3 DeepSeek 671 128K DeepSeek License
DeepSeek V3 DeepSeek 671 128K DeepSeek License
Gemma 2 Gemma 27 8K Gemma Terms
Gemma 2 Gemma 27 8K Gemma Terms
Gemma 3 Gemma 27 128K Gemma Terms
Gemma 3 Gemma 27 128K Gemma Terms
Llama 3.1 Llama 405 128K Llama 3.1 Community
Llama 3.1 Llama 405 128K Llama 3.1 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.2 Llama 90 128K Llama 3.2 Community
Llama 3.3 Llama 70 128K Llama 3.3 Community
Llama 3.3 Llama 70 128K Llama 3.3 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Llama 4 (Scout/Maverick) Llama 10,000K Llama 4 Community
Phi-4 Phi 14 16K MIT
Phi-4 Phi 14 16K MIT
Qwen 3 Qwen 235 131K Apache-2.0
Qwen 3 Qwen 235 131K Apache-2.0

Source: scores published on release by each model's creator for MT-Bench.

Coding agents

Terminal- and editor-native agents that actually write and run code. For a pleb Hashcenter, the columns that matter are not the leaderboard hype — they're whether you can read the source and whether it runs air-gapped against your own local model. A closed agent that phones a vendor on every keystroke is a kill switch waiting to happen.

Agent Open-source? Runs locally / air-gappable? License Stars SWE-bench Verified
Claude Code No (proprietary) No — needs Anthropic API/login Proprietary Model-dependent (Claude)
Codex CLI Yes No — OpenAI models only Apache-2.0 ~75k Model-dependent (GPT/Codex)
Aider Yes Yes — any local model (Ollama, etc.) Apache-2.0 44k Model-dependent (BYO LLM)
Cline Yes Yes — Ollama / LM Studio Apache-2.0 62.7k Model-dependent (BYO LLM)
Continue Yes Yes — Ollama / local models Apache-2.0 33.5k Model-dependent (BYO LLM)
OpenHands Yes Yes — any OpenAI-compatible/local MIT 40k+ See source (scaffold ~66–77%)

Open-source status, license and star counts read from each project's own GitHub repo. SWE-bench Verified is a score of the underlying model, not the harness — these agents are model-agnostic, so we cite the source rather than pin a number to the tool. Data as of June 2026, verify quarterly.

Agent frameworks

Orchestration toolkits for wiring agents together — the plumbing behind MCP servers and multi-agent crews. The sovereign questions: can you self-host it, point it at a local LLM, run it on your own node, and let it pay for what it consumes over Lightning (L402) instead of a corporate card? Almost nothing does the last one yet — which is exactly why it matters.

Framework Self-host? Local-LLM? Runs on your node? Lightning / L402-capable? License Stars
Hermes Agent Yes Yes Yes No MIT 179k
AutoGPT Yes Partial Yes No MIT (+Polyform platform) 185k
AutoGen Yes Yes Yes No MIT / CC-BY-4.0 58.7k
CrewAI Yes Yes Yes No MIT 52.8k
LangGraph Yes Yes Yes No MIT 33.8k
Lightning Agent Tools Yes Yes Yes Yes See source

License and star counts read from each project's own GitHub repo. "L402-capable" means the toolkit can pay for or gate APIs over the Lightning Network natively — most general frameworks cannot, and bolt-on payment skills are not counted. Lightning Agent Tools (Lightning Labs) is the reference L402 toolkit. Data as of June 2026, verify quarterly.

Hardware leaderboard

Every card and appliance in the database, stacked on three axes. VRAM is king for 70B-class models; bandwidth rules token throughput; TDP decides what your 120V circuit can tolerate.

VRAM (GB) — raw capacity

TDP (watts) — 120V circuit impact

FP16 TFLOPS — raw throughput

Bang per buck (FP16 TFLOPS vs street price)

Higher and to the left is better. Bottom-right = premium territory.

A note on Hashcenters

These numbers are for owner-operated Hashcenters — a rack in your garage, a GPU pair under your desk, a Mac Studio on the shelf. Rented cloud capacity lives by different rules (zero control, rising rates, someone else's kill switch). If you're sizing a heating setup instead of a server farm, start with Heating with Inference.

The Hashcenter — owner-operated, pleb-scale, sovereign workload — is the alternative to the hyperscaler AI datacenter. See the Sovereign AI for Bitcoiners Manifesto for why, From S19 to Your First AI Hashcenter for how, and Used RTX 3090 for LLMs in 2026 for what to buy.

Charts rendered with Chart.js (MIT). Standing on the shoulders of every vendor and model creator who published the underlying numbers.

Editorial review and limitations

Reviewed by D-Central's mining hardware and ASIC repair editorial team for practical accuracy, buyer risk, repair context, and operational assumptions. Verify current hardware price, stock, network difficulty, BTC price, power rate, shipping, tax, firmware, and device condition before buying, hosting, repairing, or retiring mining hardware.