SWE-Bench: Measuring Coding Ability

For years, we tested AI coding ability with HumanEval—simple, isolated functions like “write a function to reverse a string.”

But real software engineering isn’t writing reverse_string(). It’s navigating a massive codebase, understanding dependencies, fixing a bug in utils.py that breaks api.py, and writing a test to prove it works.

Enter SWE-Bench.

What is SWE-Bench?

SWE-Bench (Software Engineering Benchmark) is a dataset of real-world GitHub issues drawn from popular Python repositories like scikit-learn, flask, django, and requests.

It asks the AI to:

  1. Read a GitHub issue description (bug report or feature request).
  2. Explore an entire codebase (thousands of files).
  3. Write a patch (code changes) to fix the issue.
  4. Pass the new regression tests associated with that issue.

This is exactly what a human Junior Developer does.

Why HumanEval Wasn’t Enough

HumanEval / MBPP:

  • Task: “Write one function.”
  • Context: Zero (self-contained).
  • Difficulty: LeetCode Easy.
  • Result: Models like GPT-4 hit 90%+, but still couldn’t build apps reliably.

SWE-Bench:

  • Task: “Fix this obscure bug in Django’s ORM.”
  • Context: 100,000+ lines of code.
  • Difficulty: Real-world messiness.
  • Result: Even GPT-4 initially solved only ~1.7% of tasks (SWE-Bench Verified has since seen higher scores, but it’s hard).

How It Works

A typical SWE-Bench task looks like this:

Repo: matplotlib/matplotlib Issue: “Colorbar tick labels overlap when using log scale” Input to Model:

  • The issue text.
  • Access to the repo files.

The Agent’s Job:

  1. search_code("colorbar") -> finds relevant files.
  2. read_file("lib/matplotlib/colorbar.py")
  3. Reproduce the bug.
  4. Edit the file.
  5. Run tests.

Success Condition: The patch applies, the code compiles, and the new test case (that exposes the bug) passes, without breaking existing tests.

The “Verified” vs “Lite” Versions

Because the full SWE-Bench is incredibly hard and slow to run (requires a Docker container for every repo):

  • SWE-Bench Verified: A human-validated subset of 500 issues that are confirmed to be solvable and deterministic. This is the current standard.
  • SWE-Bench Lite: A smaller subset (300 issues) for faster iteration.

Current Leaderboard (Approximate - Jan 2026)

Note: These numbers change weekly.

  1. Devin / Factory / Specialized Agents: ~40-50%
    • Agents that use tools, loops, and terminal access score much higher than raw models.
  2. Claude 3.5 Sonnet (Agentic): ~35-40%
  3. GPT-4o (Agentic): ~30-35%
  4. Open Source (Llama 3 + Agent scaffold): ~20-25%

Implications for Developers

  1. Agent Frameworks Matter: You can’t just prompt “Fix this.” You need an architecture (like Devin or open-source equivalents like OpenDevin / Swe-agent) that allows the model to browse, edit, and test in a loop.
  2. Context Window is King: To fix a bug in a huge repo, the model needs to “see” a lot of code. Models with 128k+ or 1M+ context windows perform significantly better here.
  3. We Are Not There Yet: A 40% success rate means the AI fails more often than it succeeds on complex tickets. It’s a “Junior Dev” that needs supervision, not a “Senior Architect.”

Conclusion

SWE-Bench is the reality check the AI industry needed. It proves that solving coding puzzles != building software. If you want to know if a model can actually help you work, look at its SWE-Bench score, not its HumanEval score.


Next: MMLU & GPQA — Testing general knowledge and expert reasoning.