Successor to BIG-Bench Hard (BBH), replacing its 23 saturated tasks with much harder variants designed to resist frontier model capabilities. Harmonic mean accuracy: 9.8% (best general model), 44.8% (best reasoning model) — vs. BBH where frontier models scored >90%. Tasks span logical deduction, causal reasoning, spatial understanding, and multi-step inference.

Rapidly adopted as the BBH successor in eval suites. ACL 2025. By Google DeepMind. The original BIG-Bench (2022, 204 tasks, 444 authors from 132 institutions) was one of the most influential multi-task evaluation efforts; BBEH preserves its spirit while restoring discriminative power.

Paper

Venue ACL 2025
benchmarkevaluationreasoningfoundational