MMLU and GPQA: Testing Knowledge and Reasoning
How do we measure if an AI is smart? MMLU tests breadth, GPQA tests depth. Understanding the two most important general benchmarks.
MMLU and GPQA: Testing Knowledge and Reasoning
When a new AI model drops, the first number everyone looks at is its MMLU score. But recently, a harder test called GPQA has emerged to separate the experts from the average.
Let’s break down these acronyms and why they matter for understanding AI intelligence.
MMLU: The SAT for AI
MMLU (Massive Multitask Language Understanding) is the industry standard for measuring “general world knowledge.”
What is it?
- Format: Multiple-choice questions (A, B, C, D).
- Scope: 57 subjects across STEM, humanities, and social sciences.
- Topics: Everything from Elementary Math and US History to Law, Medicine, and High-Energy Physics.
Example Question (Microeconomics):
If the price of a substitute good increases, the demand curve for the original good will: A) Shift left B) Shift right C) Stay the same D) Invert
Correct Answer: B
Why MMLU matters
It tests breadth. A high MMLU score means the model has “read the internet” and remembers facts well.
The Scores (Approximate)
- Random Guessing: 25%
- Average Human: ~35%
- Expert Human: ~89%
- GPT-3.5: ~70%
- GPT-4 / Claude 3 Opus / Gemini 1.5: ~86-88%
The problem: We are hitting the ceiling. Top models are now matching human experts, making it hard to distinguish between them.
GPQA: The PhD Exam
Enter GPQA (Google-Proof Q&A Benchmark).
What is it?
A dataset of extremely difficult questions written by PhDs in biology, physics, and chemistry.
The Catch
The questions are designed to be “Google-Proof”. Even if you have full internet access, you cannot easily find the answer unless you actually understand the underlying science.
Example (Conceptual Physics):
A complex scenario involving fluid dynamics, rotational inertia, and friction coefficients that requires a multi-step derivation to solve.
(I can’t even quote a real one easily because they are that dense.)
Why GPQA matters
It tests deep reasoning. It’s not about memorization; it’s about applying principles to novel, hard problems.
The Scores
- PhD Experts (in their field): ~65-80%
- Non-Expert Humans (with Google): ~34% (barely better than random)
- GPT-4o: ~50-55%
- Claude 3.5 Sonnet: ~55-60%
- OpenAI o1 (Reasoning Model): ~75-80%
The “Reasoning” Gap
Notice the jump? Standard LLMs (GPT-4) struggle on GPQA. But “Reasoning Models” (like OpenAI’s o1 or “Strawberry”) that “think” before they answer crush this benchmark. This proves that Chain-of-Thought is required for deep expertise.
Which One Should You Trust?
- Use MMLU to see if a model is “smart generally.” If a model scores <60% on MMLU, it’s probably too dumb for complex business tasks.
- Use GPQA to compare the absolute cutting edge. If Model A beats Model B on GPQA, it is likely better at complex logic, coding, and scientific research.
Contamination Warning
A major issue with benchmarks is contamination. If the questions and answers are on the web, the model might have just memorized them during training.
- MMLU is heavily contaminated (it’s been around for years).
- GPQA is newer and harder to memorize, but still at risk.
Always look for “held-out” evaluations or private leaderboards (like Chatbot Arena) to verify these numbers.
Next: HumanEval — The classic coding test.