Definition
HumanEval is a code-generation benchmark released by OpenAI in 2021 alongside the Codex paper. It contains 164 hand-written Python programming problems, each consisting of a function signature, a natural-language docstring describing the task, and a set of hidden unit tests. Rather than comparing the model's text to a reference answer, HumanEval executes the generated code and checks whether it passes every test, making it a measure of functional correctness rather than surface similarity.
The pass@k metric
HumanEval popularized the pass@k metric. The model generates k independent solutions for each problem, and a problem counts as solved if at least one sample passes all its tests. Common reports are pass@1 (a single attempt), pass@10, and pass@100. Pass@1 reflects how reliably a model writes correct code on the first try, while higher k values reveal whether a correct solution exists among many samples. Each problem ships with an average of roughly 7.7 unit tests, so passing is non-trivial.
Strengths and limits
Because it runs real tests, HumanEval rewards genuinely working code instead of plausible-looking text, which makes it far harder to game than overlap-based metrics. Its limits are equally clear: 164 short, self-contained Python functions do not represent large codebases, multi-file projects, or other languages, and the problems are now widely present in training data, raising contamination concerns. Successor suites such as MBPP and HumanEval+ extend coverage and add stricter tests.
HumanEval is a practical signal when choosing a self-hosted coding model for a sovereign workstation. Read it next to general-knowledge tests like the MMLU benchmark and human-preference rankings such as Chatbot Arena Elo for a rounded picture.
In Simple Terms
HumanEval is a code-generation benchmark released by OpenAI in 2021 alongside the Codex paper. It contains 164 hand-written Python programming problems, each consisting of a…
