GPQA (Graduate-Level Google-Proof Q&A)

Sovereign AI

GPQA (Graduate-Level Google-Proof Q&A) is a benchmark introduced in 2023 by David Rein and colleagues to test deep scientific reasoning in biology, physics, and chemistry. Its questions are written by domain experts holding or pursuing PhDs and are deliberately constructed to be "Google-proof": skilled non-experts given unrestricted internet access and more than thirty minutes per question reach only about 34% accuracy, while genuine domain experts reach roughly 65%. That design targets the gap that matters most when evaluating a model — the difference between retrieving facts and actually reasoning through a hard technical problem that lookup cannot solve.

The Diamond subset

The most widely cited slice is GPQA Diamond, a 198-question subset where both expert annotators agreed on the correct answer but most non-experts answered incorrectly — the cleanest, hardest core of the dataset. Each question is multiple-choice with four options, giving a 25% random-guess baseline that makes scores easy to interpret: a model at 30% is guessing with a slight edge, while one well above the ~65–70% expert band is doing something genuinely beyond typical human expert performance. Diamond became the standard headline figure for frontier reasoning models precisely because its difficulty resisted the saturation that overtook earlier knowledge tests, and because expert authorship makes shallow retrieval strategies ineffective.

Why it matters — and its limits

GPQA measures whether a model can sustain expert-level scientific reasoning rather than recite memorized trivia, which makes it a sharper discriminator among top-tier models than broad tests where scores cluster. Its caveats are real, though: the question count is small, so a few points of difference can be noise; coverage narrows to three natural sciences, saying little about code, mathematics, or judgment; and as with every public benchmark, the risk of questions leaking into training data grows over time, quietly inflating later scores. A GPQA number is most informative read as a trend across model versions and alongside other evaluations, never as a single verdict.

Reading benchmarks like a sovereign operator

How the dataset is built

The construction process is what gives the benchmark its teeth. The full GPQA set contains 448 questions, each authored by a domain expert and then answered independently by other experts and by skilled non-experts who were given generous time and full internet access. That two-sided validation is the filter: questions experts agree on but searchers still miss are, by construction, resistant to lookup — the "Google-proof" property in the name. Diamond distills the strictest slice of that filter into its 198 questions. The methodology matters more than the trivia of the numbers, because it points at what the benchmark actually certifies: performance on questions where retrieval fails and multi-step domain reasoning is the only path to the answer. It also explains the benchmark's fragility — expert authorship makes questions expensive, which keeps the set small, which is why scores carry error bars worth respecting. Treat single-digit score gaps between models as ties unless they replicate across other reasoning evaluations.

For someone choosing an open-weight model to run on their own hardware, GPQA earns a specific place in the toolkit: it indicates reasoning depth, which correlates with how well a model handles genuinely hard technical questions — the kind a miner might ask about power electronics or a node runner about protocol edge cases. But benchmark tables are marketing surfaces, and the practical questions remain local ones: does the model fit your VRAM after quantization, and does it hold up on your tasks? The verify-don't-trust instinct applies to AI claims as much as to anything else: run the candidate model on a private set of questions you know cold, and let published scores set expectations rather than settle them. GPQA complements the breadth-oriented MMLU benchmark and the step-by-step math of the GSM8K benchmark; together they give a fuller read on a model's reasoning depth than any one number.

GPQA (Graduate-Level Google-Proof Q&A) is a benchmark introduced in 2023 by David Rein and colleagues to test deep scientific reasoning in biology, physics, and chemistry.…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners