Scaling Laws: Why Bigger Models Are (Usually) Better

If you ask an AI researcher in 2025 why companies are spending $10 billion on a single training cluster, they will likely point to one thing: Scaling Laws.

These empirical observations have become the “Moore’s Law” of Artificial Intelligence, giving companies the confidence to invest massive capital with the expectation of predictable returns in capability.

The Kaplan Paper (2020)

In 2020, OpenAI researchers (including Jared Kaplan) published a seminal paper: “Scaling Laws for Neural Language Models.” They found a power-law relationship between three variables and the model’s performance (test loss):

  1. N: The number of parameters in the model.
  2. D: The size of the dataset (tokens).
  3. C: The amount of compute used for training.

The Power Law

The relationship is roughly: $$ L(N) \propto N^{-\alpha} $$ Where $L$ is the loss (error rate) and $\alpha$ is a constant.

In plain English: As you increase parameters, data, and compute exponentially, the error rate drops linearly. It doesn’t plateau. It keeps getting better.

Implications of Scaling

1. Predictability

This was a game-changer. Before this, training a model was hit-or-miss. Now, researchers could train a small model (using 0.1% of the budget), measure the scaling curve, and predict exactly how smart a massive model would be before spending $100 million to train it.

2. The “Bitter Lesson”

Rich Sutton’s “The Bitter Lesson” essay argued that clever engineering hacks (like hand-coding grammar rules) always eventually lose to raw scale. Scaling laws proved him right. The most effective way to improve AI wasn’t better algorithms, but more compute.

Emergent Abilities

While the loss (prediction error) scales smoothly, the capabilities often emerge suddenly. This is called Emergence.

  • Small Model: Can’t do arithmetic.
  • Medium Model: Gets 10% right.
  • Large Model: Suddenly gets 90% right.

This “phase transition” behavior is why companies race to build larger models. They are hunting for the next emergent capability—whether it’s perfect reasoning, biological research, or autonomous coding.

Limits to Scaling?

Is it infinite? Not quite.

1. The Data Wall

Scaling laws assume you have infinite high-quality data. As we discussed in our Synthetic Data article, we are hitting the limits of human text. Scaling parameters without scaling data quality leads to overfitting.

2. Diminishing Returns

While loss continues to drop, the utility of that drop might decrease. Does predicting the next token 0.001% better actually result in a “smarter” answer, or just a more grammatically safe one?

3. Economic Constraints

Building a $100B cluster is possible. Building a $10T cluster requires nation-state resources. The laws of physics (power delivery, heat dissipation) eventually clash with the laws of scaling.

Compute-Optimal Scaling (Chinchilla)

In 2022, DeepMind refined these laws with the Chinchilla paper, which argued that most models (like GPT-3) were actually under-trained. They were too big and didn’t see enough data. We’ll cover this in depth in our next article.

Summary

Scaling laws are the compass guiding the AI industry. They promise that if we build it bigger, it will get smarter. Until this law breaks, the race for larger GPUs and bigger data centers will not stop.