Definition
GPQA (Graduate-Level Google-Proof Q&A) is a benchmark introduced in 2023 by David Rein and colleagues to test deep reasoning in biology, physics, and chemistry. Its questions are written by domain experts holding or pursuing PhDs and are deliberately constructed to be "Google-proof": skilled non-experts with unrestricted internet access and more than thirty minutes per question reach only about 34% accuracy, while domain experts reach roughly 65%. This design targets the gap between recalling facts and genuinely reasoning through hard technical problems.
The Diamond subset
The most widely cited slice is GPQA Diamond, a 198-question subset where two expert annotators agreed on the correct answer but most non-experts answered incorrectly. Each question is multiple-choice with four options, giving a 25% random-guess baseline. Diamond has become a standard headline figure for frontier reasoning models because its difficulty resists the saturation that overtook earlier knowledge tests, and because its expert authorship makes simple web lookup ineffective.
Why it matters
GPQA measures whether a model can handle genuinely expert-level scientific reasoning rather than memorized trivia, which makes it a sharper discriminator among top models than broad tests where many score similarly. Its caveats are the small question count, narrowing to three sciences, and the ever-present risk of the questions leaking into training data over time. As with any single benchmark, a GPQA number is most informative when compared across model versions and read alongside other evaluations.
GPQA complements the breadth-oriented MMLU benchmark and the step-by-step math of the GSM8K benchmark, together giving a sovereign operator a fuller read on a model's reasoning depth.
In Simple Terms
GPQA (Graduate-Level Google-Proof Q&A) is a benchmark introduced in 2023 by David Rein and colleagues to test deep reasoning in biology, physics, and chemistry. Its…
