SimpleQA
eval4,326 short-form factuality questions adversarially collected against GPT-4 to measure factual recall and calibration. Questions are designed to have single, unambiguous, verifiable answers. Tests whether models know what they know — rewarding calibrated abstention over confident hallucination.
Used in the Epoch Capabilities Index (as SimpleQA Verified). Complements Artificial Analysis' AA-Omniscience (which tests similar capabilities at larger scale with 6,000 questions). Still discriminative for frontier models. By OpenAI.