SimpleQA | Lab Index

4,326 short-form factuality questions adversarially collected against GPT-4 to measure factual recall and calibration. Questions are designed to have single, unambiguous, verifiable answers. Tests whether models know what they know — rewarding calibrated abstention over confident hallucination.

Used in the Epoch Capabilities Index (as SimpleQA Verified). Complements Artificial Analysis' AA-Omniscience (which tests similar capabilities at larger scale with 6,000 questions). Still discriminative for frontier models. By OpenAI.

Paper (arXiv)Epoch ECI Leaderboard

Paper

Citations 12

arXiv HTML

Evaluation Details

Questions 4,326

Domains 10

Scoring LLM-graded short answers: prompted ChatGPT classifier grades each response correct / incorrect / not attempted; headline metrics are overall % correct and F-score (harmonic mean of overall correct and correct-given-attempted)

Human baseline 94.4% (third human AI trainer on a 1,000-question quality-check sample, per ChatGPT grader; authors estimate ~3% inherent dataset error rate)

Domains: science & technology, politics, art, geography, sports, music, TV shows, history, video games, other

benchmarkevaluationfactuality

Your notes

Paper

Evaluation Details