The most comprehensive graduate-level knowledge reasoning benchmark, spanning 285 disciplines (13 major, 72 fields, 285 subfields) with 26,529 questions — a 130× scale-up over GPQA Diamond's 198 questions in 3 domains. For the first time includes long-tail disciplines such as agriculture, light industry, and service science alongside mainstream STEM. Average of 9.67 answer options per question (vs. 4 in GPQA), making random guessing far harder (~10% vs. 25%).

Uses a Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement. DeepSeek-R1 leads at 61.82%, indicating substantial headroom remains. Developed by the Doubao (Seed) team at ByteDance in collaboration with the M-A-P open-source community.

Evaluation Details

Questions 26,529
Tasks 285
Domains 13
Scoring multiple-choice accuracy (avg. 9.67 answer options per question)
Domains: engineering, science, medicine, agronomy, economics, education, history, law, literature and arts, management, military science, philosophy, sociology

Dataset

benchmarkevaluationopen-source