4,326 short-form factuality questions adversarially collected against GPT-4 to measure factual recall and calibration. Questions are designed to have single, unambiguous, verifiable answers. Tests whether models know what they know — rewarding calibrated abstention over confident hallucination.

Used in the Epoch Capabilities Index (as SimpleQA Verified). Complements Artificial Analysis' AA-Omniscience (which tests similar capabilities at larger scale with 6,000 questions). Still discriminative for frontier models. By OpenAI.

Paper

Citations 12

Evaluation Details

Questions 4,326
Domains 10
Scoring LLM-graded short answers: prompted ChatGPT classifier grades each response correct / incorrect / not attempted; headline metrics are overall % correct and F-score (harmonic mean of overall correct and correct-given-attempted)
Human baseline 94.4% (third human AI trainer on a 1,000-question quality-check sample, per ChatGPT grader; authors estimate ~3% inherent dataset error rate)
Domains: science & technology, politics, art, geography, sports, music, TV shows, history, video games, other
benchmarkevaluationfactuality