HumanEval and Code Generation Benchmarks

If you read a paper about a new LLM, you will inevitably see a “HumanEval” score. It is the lingua franca of coding ability. But what exactly is it testing?

What is HumanEval?

Released by OpenAI in 2021 (alongside the Codex model), HumanEval is a dataset of 164 hand-written Python coding problems.

The Format

Each problem consists of:

  1. A function signature.
  2. A docstring describing the task.
  3. Unit tests (hidden from the model).

Example:

def fib4(n: int):
    """
    The Fib4 number sequence is a sequence similar to the Fibbonacci sequnece that's defined as follows:
    fib4(0) -> 0
    fib4(1) -> 0
    fib4(2) -> 2
    fib4(3) -> 0
    fib4(n) -> fib4(n-1) + fib4(n-2) + fib4(n-3) + fib4(n-4).
    Please write a function to efficiently compute the n-th element of the fib4 number sequence.  Do not use recursion.
    >>> fib4(5)
    4
    >>> fib4(6)
    8
    """

The model reads the prompt and generates the function body.

Pass@k

You’ll often see scores like Pass@1 or Pass@10.

  • Pass@1: The model generates code once. If it passes the tests, it counts as a success.
  • Pass@10: The model generates 10 different solutions. If any one of them passes, it counts.

Pass@1 is the gold standard for “how good is this model out of the box?”

MBPP (Mostly Basic Python Problems)

Often cited alongside HumanEval is MBPP. It’s very similar but contains ~974 crowdsourced entry-level Python problems.

  • Difficulty: Generally easier/simpler than HumanEval.
  • Use: Confirms basic syntactic competence.

The Score Inflation

When HumanEval was released:

  • GPT-3: ~0% (couldn’t code)
  • Codex (Initial): ~28%
  • GPT-4: ~67% (early versions)
  • GPT-4o / Claude 3.5: ~92%+

We have effectively “solved” HumanEval. When models reach 90%+, the benchmark stops being useful for distinguishing the very best. It becomes a hygiene check: “Is this model broken?” rather than “Is this model genius?”

Why It’s “Showing Its Age”

  1. Isolation: Real coding isn’t writing a single isolated function. It involves imports, classes, dependencies, and external libraries.
  2. LeetCode Style: It tests algorithmic knowledge (dynamic programming, string manipulation), not software engineering (API design, refactoring).
  3. Contamination: Every model trained on GitHub likely has seen these exact problems or variations of them.

When to Look at HumanEval

Despite its flaws, HumanEval is still the best quick-glance metric for Small Language Models (SLMs).

If you are looking at a 7B parameter model (like Llama 3 8B or Mistral 7B):

  • Score > 60%: Excellent. Useful for local coding assistance.
  • Score < 30%: Avoid for coding tasks.

For massive frontier models (GPT-5 class), ignore HumanEval and look at SWE-Bench.

Conclusion

HumanEval was a critical milestone. It taught us that LLMs could write code. But as AI evolves from “Code Completion” to “Software Engineering,” we are moving toward harder, repo-level benchmarks.


Next: Chatbot Arena — The only benchmark that truly captures “vibe.”