198 "Google-Proof" graduate-level science questions (the Diamond subset) in biology, physics, and chemistry. Designed so that even skilled non-experts with unlimited internet access cannot answer them (34% non-expert vs. 65% PhD expert vs. 25% random baseline). Created to enable scalable oversight experiments — studying how humans can supervise AI that may surpass human expertise.

One of the most widely adopted science evals of the 2024-2025 era. Used in both the AA Intelligence Index v4.1 (6% weight) and the Epoch Capabilities Index. Top models now at 94.3%, exceeding PhD experts by 24pp. Epoch AI analysis estimates the theoretical ceiling at ~95% due to ~5-10% invalid questions, meaning the benchmark is effectively saturated for frontier differentiation. COLM 2024. By Rein, Hou, Stickland, Petty, Pang, Dirani, Michael, and Bowman (NYU + Anthropic).

Paper

Venue COLM 2024
Citations 21

Evaluation Details

Questions 198
Domains 3
Scoring multiple-choice accuracy (4 options per question, pass@1)
Human baseline 65% (PhD experts in-domain; 74% discounting clear retrospective mistakes); 34% (skilled non-experts with unrestricted web access); 81.3% expert accuracy on the Diamond subset
Random baseline 25% (4-way multiple choice)
Saturation Near saturation: top models score 93-94% on Diamond (AA, June 2026: Gemini 3.1 Pro Preview 94.1%, GPT-5.5 xhigh 93.5%), well above the 81.3% Diamond expert-validator accuracy; Epoch AI estimates ~8% of questions may be invalid ('at least 90% of the benchmark is valid') but concludes GPQA Diamond 'has a bit more juice left' — full saturation 'only a matter of time'
Used in: AA Intelligence Index v4.1Epoch Capabilities Index
Domains: biology, physics, chemistry
benchmarkevaluationscience