MMLU and GPQA: Testing Knowledge and Reasoning

When a new AI model drops, the first number everyone looks at is its MMLU score. But recently, a harder test called GPQA has emerged to separate the experts from the average.

Let’s break down these acronyms and why they matter for understanding AI intelligence.

MMLU: The SAT for AI

MMLU (Massive Multitask Language Understanding) is the industry standard for measuring “general world knowledge.”

What is it?

  • Format: Multiple-choice questions (A, B, C, D).
  • Scope: 57 subjects across STEM, humanities, and social sciences.
  • Topics: Everything from Elementary Math and US History to Law, Medicine, and High-Energy Physics.

Example Question (Microeconomics):

If the price of a substitute good increases, the demand curve for the original good will: A) Shift left B) Shift right C) Stay the same D) Invert

Correct Answer: B

Why MMLU matters

It tests breadth. A high MMLU score means the model has “read the internet” and remembers facts well.

The Scores (Approximate)

  • Random Guessing: 25%
  • Average Human: ~35%
  • Expert Human: ~89%
  • GPT-3.5: ~70%
  • GPT-4 / Claude 3 Opus / Gemini 1.5: ~86-88%

The problem: We are hitting the ceiling. Top models are now matching human experts, making it hard to distinguish between them.

GPQA: The PhD Exam

Enter GPQA (Google-Proof Q&A Benchmark).

What is it?

A dataset of extremely difficult questions written by PhDs in biology, physics, and chemistry.

The Catch

The questions are designed to be “Google-Proof”. Even if you have full internet access, you cannot easily find the answer unless you actually understand the underlying science.

Example (Conceptual Physics):

A complex scenario involving fluid dynamics, rotational inertia, and friction coefficients that requires a multi-step derivation to solve.

(I can’t even quote a real one easily because they are that dense.)

Why GPQA matters

It tests deep reasoning. It’s not about memorization; it’s about applying principles to novel, hard problems.

The Scores

  • PhD Experts (in their field): ~65-80%
  • Non-Expert Humans (with Google): ~34% (barely better than random)
  • GPT-4o: ~50-55%
  • Claude 3.5 Sonnet: ~55-60%
  • OpenAI o1 (Reasoning Model): ~75-80%

The “Reasoning” Gap

Notice the jump? Standard LLMs (GPT-4) struggle on GPQA. But “Reasoning Models” (like OpenAI’s o1 or “Strawberry”) that “think” before they answer crush this benchmark. This proves that Chain-of-Thought is required for deep expertise.

Which One Should You Trust?

  • Use MMLU to see if a model is “smart generally.” If a model scores <60% on MMLU, it’s probably too dumb for complex business tasks.
  • Use GPQA to compare the absolute cutting edge. If Model A beats Model B on GPQA, it is likely better at complex logic, coding, and scientific research.

Contamination Warning

A major issue with benchmarks is contamination. If the questions and answers are on the web, the model might have just memorized them during training.

  • MMLU is heavily contaminated (it’s been around for years).
  • GPQA is newer and harder to memorize, but still at risk.

Always look for “held-out” evaluations or private leaderboards (like Chatbot Arena) to verify these numbers.


Next: HumanEval — The classic coding test.