AI Leaderboards

Benchmarks & Hardware

Read Rankings as Snapshots

Leaderboards are snapshots of specific tests, versions, prompts, runtimes, and hardware. They should help shortlist options, not replace local testing with the actual data, latency, privacy, and maintenance constraints of the deployment.

Use With Hardware Context

Pair rankings with model license, VRAM requirement, quantization quality, driver support, energy use, heat, and noise before buying hardware or standardizing a workflow.

Last reviewed April 16, 2026.

Source Basis

AI pages should cite model cards, project repositories, release notes, hardware vendor specifications, driver/runtime documentation, and D-Central infrastructure experience. Benchmark claims should preserve model version, quantization, context length, hardware, driver, runtime, and test date.

ASIC miners do not run LLM workloads. D-Central connects AI to its practical infrastructure domain through power, heat, privacy, local compute, maintenance, and hardware operations.

Reviewer

Reviewed by D-Central editorial staff with a Bitcoin infrastructure, privacy, hardware, and operations lens. Sensitive data, private keys, customer records, and production secrets should not be loaded into experimental AI stacks.

Freshness Policy

Model releases, licenses, GPU prices, driver support, inference runtimes, and leaderboard results change quickly. AI pages should preserve the model and hardware version used, identify stale benchmarks, and separate local privacy guidance from performance claims.

Last reviewed April 16, 2026. D-Central editorial and repair intake, Montreal, Quebec.

A running scoreboard for self-hosted AI — which open models are tested against what, and which pieces of silicon make a pleb Hashcenter hum. All data comes from the model creators and silicon vendors themselves.

LLM benchmark coverage

Which models in our catalogue have been tested against each benchmark. Scores are published on release by each model's creator — we don't re-run evals. Hit the model page for the creator's full number.

AIME-2024 (3 models tagged)

Model	Score	Family	Max Params (B)	Context	License
Qwen 3	85.7	Qwen	235	131K	Apache-2.0
DeepSeek R1	79.8	DeepSeek	671	128K	MIT (most distills)
DeepSeek V3	39.2	DeepSeek	671	128K	DeepSeek License

Source: scores published on release by each model's creator for AIME-2024.

GPQA (12 models tagged)

Model	Score	Family	Max Params (B)	Context	License
Qwen 3	77.5	Qwen	235	131K	Apache-2.0
DeepSeek R1	71.5	DeepSeek	671	128K	MIT (most distills)
Llama 4 (Scout/Maverick)	69.8	Llama	—	10,000K	Llama 4 Community
DeepSeek V3	59.1	DeepSeek	671	128K	DeepSeek License
Phi-4	56.1	Phi	14	16K	MIT
Llama 3.1	50.7	Llama	405	128K	Llama 3.1 Community
Llama 3.3	50.5	Llama	70	128K	Llama 3.3 Community
Qwen 2.5	49.0	Qwen	72	128K	Apache-2.0 (most sizes)
Mistral Small 3	45.3	Mistral	24	33K	Apache-2.0
Gemma 3	24.3	Gemma	27	128K	Gemma Terms
Gemma 2	—	Gemma	27	8K	Gemma Terms
Llama 3.2	—	Llama	90	128K	Llama 3.2 Community

Source: scores published on release by each model's creator for GPQA.

HumanEval (14 models tagged)

Model	Score	Family	Max Params (B)	Context	License
Llama 3.1	89.0	Llama	405	128K	Llama 3.1 Community
Llama 3.3	88.4	Llama	70	128K	Llama 3.3 Community
Qwen 2.5	86.6	Qwen	72	128K	Apache-2.0 (most sizes)
Mistral Small 3	84.8	Mistral	24	33K	Apache-2.0
DeepSeek V3	82.6	DeepSeek	671	128K	DeepSeek License
Phi-4	82.6	Phi	14	16K	MIT
Gemma 2	51.8	Gemma	27	8K	Gemma Terms
Gemma 3	48.8	Gemma	27	128K	Gemma Terms
Mixtral 8x7B	40.2	Mistral	46.7	33K	Apache-2.0
Mistral 7B	30.5	Mistral	7	33K	Apache-2.0
DeepSeek R1	—	DeepSeek	671	128K	MIT (most distills)
Llama 3.2	—	Llama	90	128K	Llama 3.2 Community
Llama 4 (Scout/Maverick)	—	Llama	—	10,000K	Llama 4 Community
Qwen 3	—	Qwen	235	131K	Apache-2.0

Source: scores published on release by each model's creator for HumanEval.

MATH (12 models tagged)

Model	Score	Family	Max Params (B)	Context	License
DeepSeek R1	97.3	DeepSeek	671	128K	MIT (most distills)
DeepSeek V3	90.2	DeepSeek	671	128K	DeepSeek License
Qwen 2.5	83.1	Qwen	72	128K	Apache-2.0 (most sizes)
Phi-4	80.4	Phi	14	16K	MIT
Llama 3.3	77.0	Llama	70	128K	Llama 3.3 Community
Llama 3.1	73.8	Llama	405	128K	Llama 3.1 Community
Qwen 3	71.8	Qwen	235	131K	Apache-2.0
Mistral Small 3	70.6	Mistral	24	33K	Apache-2.0
Gemma 3	50.0	Gemma	27	128K	Gemma Terms
Gemma 2	42.3	Gemma	27	8K	Gemma Terms
Mixtral 8x7B	28.4	Mistral	46.7	33K	Apache-2.0
Mistral 7B	13.1	Mistral	7	33K	Apache-2.0

Source: scores published on release by each model's creator for MATH.

MMLU (14 models tagged)

Model	Score	Family	Max Params (B)	Context	License
DeepSeek R1	90.8	DeepSeek	671	128K	MIT (most distills)
Qwen 3	88.7	Qwen	235	131K	Apache-2.0
DeepSeek V3	88.5	DeepSeek	671	128K	DeepSeek License
Llama 3.1	87.3	Llama	405	128K	Llama 3.1 Community
Qwen 2.5	86.1	Qwen	72	128K	Apache-2.0 (most sizes)
Llama 3.3	86.0	Llama	70	128K	Llama 3.3 Community
Phi-4	84.8	Phi	14	16K	MIT
Gemma 3	78.6	Gemma	27	128K	Gemma Terms
Command R+	75.7	Command	104	128K	CC-BY-NC
Gemma 2	75.2	Gemma	27	8K	Gemma Terms
Mixtral 8x7B	70.6	Mistral	46.7	33K	Apache-2.0
Mistral 7B	60.1	Mistral	7	33K	Apache-2.0
Llama 3.2	—	Llama	90	128K	Llama 3.2 Community
Llama 4 (Scout/Maverick)	—	Llama	—	10,000K	Llama 4 Community

Source: scores published on release by each model's creator for MMLU.

MMLU-Pro (1 models tagged)

Model	Score	Family	Max Params (B)	Context	License
Mistral Small 3	66.3	Mistral	24	33K	Apache-2.0

Source: scores published on release by each model's creator for MMLU-Pro.

MT-Bench (14 models tagged)

Model	Score	Family	Max Params (B)	Context	License
Qwen 2.5	9.4	Qwen	72	128K	Apache-2.0 (most sizes)
Mistral Small 3	8.4	Mistral	24	33K	Apache-2.0
Mixtral 8x7B	8.3	Mistral	46.7	33K	Apache-2.0
Mistral 7B	6.8	Mistral	7	33K	Apache-2.0
DeepSeek R1	—	DeepSeek	671	128K	MIT (most distills)
DeepSeek V3	—	DeepSeek	671	128K	DeepSeek License
Gemma 2	—	Gemma	27	8K	Gemma Terms
Gemma 3	—	Gemma	27	128K	Gemma Terms
Llama 3.1	—	Llama	405	128K	Llama 3.1 Community
Llama 3.2	—	Llama	90	128K	Llama 3.2 Community
Llama 3.3	—	Llama	70	128K	Llama 3.3 Community
Llama 4 (Scout/Maverick)	—	Llama	—	10,000K	Llama 4 Community
Phi-4	—	Phi	14	16K	MIT
Qwen 3	—	Qwen	235	131K	Apache-2.0

Source: scores published on release by each model's creator for MT-Bench.

Coding agents: Claude Code, Codex, Cursor & the local alternatives

Terminal- and editor-native agents that actually write and run code. The frontier names — Claude Code, OpenAI Codex, and Cursor — are excellent, but they are proprietary and phone a vendor on every keystroke. For a pleb Hashcenter, the columns that matter are not the leaderboard hype — they're whether you can read the source and whether the agent runs air-gapped against your own local model. That is why we list the open, BYO-LLM alternatives (Aider, Cline, Continue, OpenHands) alongside them: a closed coding agent you cannot self-host is a kill switch waiting to happen.

Agent	Open-source?	Runs locally / air-gappable?	License	Stars	SWE-bench Verified
Claude Code	No (proprietary)	No — needs Anthropic API/login	Proprietary	—	Model-dependent (Claude)
Codex CLI	Yes	No — OpenAI models only	Apache-2.0	~75k	Model-dependent (GPT/Codex)
Cursor	No (proprietary)	No — cloud login, hosted models	Proprietary	—	Model-dependent (frontier)
Aider	Yes	Yes — any local model (Ollama, etc.)	Apache-2.0	44k	Model-dependent (BYO LLM)
Cline	Yes	Yes — Ollama / LM Studio	Apache-2.0	62.7k	Model-dependent (BYO LLM)
Continue	Yes	Yes — Ollama / local models	Apache-2.0	33.5k	Model-dependent (BYO LLM)
OpenHands	Yes	Yes — any OpenAI-compatible/local	MIT	40k+	See source (scaffold ~66–77%)

Open-source status, license and star counts read from each project's own GitHub repo. SWE-bench Verified is a score of the underlying model, not the harness — these agents are model-agnostic, so we cite the source rather than pin a number to the tool. Data as of June 2026, verify quarterly.

Agent frameworks

Orchestration toolkits for wiring agents together — the plumbing behind MCP servers and multi-agent crews. The sovereign questions: can you self-host it, point it at a local LLM, run it on your own node, and let it pay for what it consumes over Lightning (L402) instead of a corporate card? Almost nothing does the last one yet — which is exactly why it matters.

Framework	Self-host?	Local-LLM?	Runs on your node?	Lightning / L402-capable?	License	Stars
Hermes Agent	Yes	Yes	Yes	No	MIT	179k
AutoGPT	Yes	Partial	Yes	No	MIT (+Polyform platform)	185k
AutoGen	Yes	Yes	Yes	No	MIT / CC-BY-4.0	58.7k
CrewAI	Yes	Yes	Yes	No	MIT	52.8k
LangGraph	Yes	Yes	Yes	No	MIT	33.8k
Lightning Agent Tools	Yes	Yes	Yes	Yes	See source	—

License and star counts read from each project's own GitHub repo. "L402-capable" means the toolkit can pay for or gate APIs over the Lightning Network natively — most general frameworks cannot, and bolt-on payment skills are not counted. Lightning Agent Tools (Lightning Labs) is the reference L402 toolkit. Data as of June 2026, verify quarterly.

Hardware leaderboard

Every card and appliance in the database, stacked on three axes. VRAM is king for 70B-class models; bandwidth rules token throughput; TDP decides what your 120V circuit can tolerate.

VRAM (GB) — raw capacity

TDP (watts) — 120V circuit impact

FP16 TFLOPS — raw throughput

Bang per buck (FP16 TFLOPS vs street price)

Higher and to the left is better. Bottom-right = premium territory.

A note on Hashcenters

These numbers are for owner-operated Hashcenters — a rack in your garage, a GPU pair under your desk, a Mac Studio on the shelf. Rented cloud capacity lives by different rules (zero control, rising rates, someone else's kill switch). If you're sizing a heating setup instead of a server farm, start with Heating with Inference.

The Hashcenter — owner-operated, pleb-scale, sovereign workload — is the alternative to the hyperscaler AI datacenter. See the Sovereign AI for Bitcoiners Manifesto for why, From S19 to Your First AI Hashcenter for how, and Used RTX 3090 for LLMs in 2026 for what to buy.

Charts rendered with Chart.js (MIT). Standing on the shoulders of every vendor and model creator who published the underlying numbers.