HumanEval
eval164 hand-written Python programming problems with function signatures, docstrings, and unit tests. Each problem tests language comprehension, algorithms, and simple mathematics. Introduced the pass@k metric (functional correctness of k generated samples) that became the standard for code generation evaluation. Released alongside the Codex model paper.
The foundational code generation benchmark — virtually every LLM and code model since 2021 reports HumanEval scores. Now saturated (frontier models >95%) and largely superseded by harder benchmarks (SWE-Bench, LiveCodeBench, Terminal-Bench) for frontier differentiation, but remains the reference point for the field. By Chen, Tworek et al. (OpenAI).
Paper
Evaluation Details
Questions 164
Domains 5
Scoring pass@k functional correctness via unit tests (avg 7.7 tests per problem)
Saturation Saturated: OpenAI simple-evals reports 99.3% (o4-mini-high), 94.5% (GPT-4.1), 90.2% (GPT-4o); 92.0% for Claude 3.5 Sonnet (listed in simple-evals under 'Other Models (Reported)', sourced from Anthropic's announcement)
Domains: Python programming, language comprehension, reasoning, algorithms, simple mathematics