Successor to BIG-Bench Hard (BBH), replacing its 23 saturated tasks with much harder variants designed to resist frontier model capabilities. Harmonic mean accuracy: 9.8% (best general model), 44.8% (best reasoning model) — vs. BBH where frontier models scored >90%. Tasks span logical deduction, causal reasoning, spatial understanding, and multi-step inference.

Rapidly adopted as the BBH successor in eval suites. ACL 2025. By Google DeepMind. The original BIG-Bench (2022, 204 tasks, 444 authors from 132 institutions) was one of the most influential multi-task evaluation efforts; BBEH preserves its spirit while restoring discriminative power.

Paper

Venue ACL 2025

Evaluation Details

Questions 4,520
Tasks 23
Domains 11
Scoring exact-match accuracy on extracted final answer ("The answer is:" prefix), aggregated across the 23 tasks by adjusted harmonic mean (micro average for BBEH Mini)
Random baseline 2.4% on the primary metric (adjusted harmonic mean over per-task random baselines, paper Table 2); 8.4% as micro average over all examples (Table 4). Per-task random baselines range 0-38% since tasks mix multiple-choice and free-form answers.
Domains: temporal reasoning, spatial and geometric reasoning, commonsense, humor understanding, causal reasoning, world entities and events, deductive logic, linguistic reasoning, counting and filtering, data structures and algorithms, arithmetic
benchmarkevaluationreasoningfoundational