HumanEval
eval164 hand-written Python programming problems with function signatures, docstrings, and unit tests. Each problem tests language comprehension, algorithms, and simple mathematics. Introduced the pass@k metric (functional correctness of k generated samples) that became the standard for code generation evaluation. Released alongside the Codex model paper.
The foundational code generation benchmark — virtually every LLM and code model since 2021 reports HumanEval scores. Now saturated (frontier models >95%) and largely superseded by harder benchmarks (SWE-Bench, LiveCodeBench, Terminal-Bench) for frontier differentiation, but remains the reference point for the field. By Chen, Tworek et al. (OpenAI).