198 "Google-Proof" graduate-level science questions (the Diamond subset) in biology, physics, and chemistry. Designed so that even skilled non-experts with unlimited internet access cannot answer them (34% non-expert vs. 65% PhD expert vs. 25% random baseline). Created to enable scalable oversight experiments — studying how humans can supervise AI that may surpass human expertise.

One of the most widely adopted science evals of the 2024-2025 era. Used in both the AA Intelligence Index v4.0 (6.25% weight) and the Epoch Capabilities Index. Top models now at 94.3%, exceeding PhD experts by 24pp. Epoch AI analysis estimates the theoretical ceiling at ~95% due to ~5-10% invalid questions, meaning the benchmark is effectively saturated for frontier differentiation. COLM 2024. By Rein, Hou, Stickland, Petty, Pang, Dirani, Michael, and Bowman (NYU + Anthropic).

Paper

Venue COLM 2024
benchmarkevaluationscience