Benchmarks January 18, 2026 ⏱ 3 min read

MMLU and GPQA: Testing Knowledge and Reasoning

How do we measure if an AI is smart? MMLU tests breadth, GPQA tests depth. Understanding the two most important general benchmarks.

mmlugpqaevaluationreasoningbenchmarks

MMLU and GPQA: Testing Knowledge and Reasoning

When a new AI model drops, the first number everyone looks at is its MMLU score. But recently, a harder test called GPQA has emerged to separate the experts from the average.

Let’s break down these acronyms and why they matter for understanding AI intelligence.

MMLU: The SAT for AI

MMLU (Massive Multitask Language Understanding) is the industry standard for measuring “general world knowledge.”

What is it?

Format: Multiple-choice questions (A, B, C, D).
Scope: 57 subjects across STEM, humanities, and social sciences.
Topics: Everything from Elementary Math and US History to Law, Medicine, and High-Energy Physics.

Example Question (Microeconomics):

If the price of a substitute good increases, the demand curve for the original good will: A) Shift left B) Shift right C) Stay the same D) Invert

Correct Answer: B

Why MMLU matters

It tests breadth. A high MMLU score means the model has “read the internet” and remembers facts well.

The Scores (Approximate)

Random Guessing: 25%
Average Human: ~35%
Expert Human: ~89%
GPT-3.5: ~70%
GPT-4 / Claude 3 Opus / Gemini 1.5: ~86-88%

The problem: We are hitting the ceiling. Top models are now matching human experts, making it hard to distinguish between them.

GPQA: The PhD Exam

Enter GPQA (Google-Proof Q&A Benchmark).

What is it?

A dataset of extremely difficult questions written by PhDs in biology, physics, and chemistry.

The Catch

The questions are designed to be “Google-Proof”. Even if you have full internet access, you cannot easily find the answer unless you actually understand the underlying science.

Example (Conceptual Physics):

A complex scenario involving fluid dynamics, rotational inertia, and friction coefficients that requires a multi-step derivation to solve.

(I can’t even quote a real one easily because they are that dense.)

Why GPQA matters

It tests deep reasoning. It’s not about memorization; it’s about applying principles to novel, hard problems.

The Scores

PhD Experts (in their field): ~65-80%
Non-Expert Humans (with Google): ~34% (barely better than random)
GPT-4o: ~50-55%
Claude 3.5 Sonnet: ~55-60%
OpenAI o1 (Reasoning Model): ~75-80%

The “Reasoning” Gap

Notice the jump? Standard LLMs (GPT-4) struggle on GPQA. But “Reasoning Models” (like OpenAI’s o1 or “Strawberry”) that “think” before they answer crush this benchmark. This proves that Chain-of-Thought is required for deep expertise.

Which One Should You Trust?

Use MMLU to see if a model is “smart generally.” If a model scores <60% on MMLU, it’s probably too dumb for complex business tasks.
Use GPQA to compare the absolute cutting edge. If Model A beats Model B on GPQA, it is likely better at complex logic, coding, and scientific research.

Contamination Warning

A major issue with benchmarks is contamination. If the questions and answers are on the web, the model might have just memorized them during training.

MMLU is heavily contaminated (it’s been around for years).
GPQA is newer and harder to memorize, but still at risk.

Always look for “held-out” evaluations or private leaderboards (like Chatbot Arena) to verify these numbers.

Next: HumanEval — The classic coding test.