Synthetic Data: Training AI on AI-Generated Content

The internet is running out of data. Or rather, we are running out of the high-quality, human-generated text required to train the next generation of Frontier models. Researchers estimate that high-quality language data stocks will be depleted by 2026. The solution? Synthetic Data—data generated by AI, for AI.

The Data Scarcity Wall

Training GPT-4 utilized petabytes of data—essentially the entire public internet. To train a GPT-5 or GPT-6 that is 10x or 100x smarter, we need significantly more data. But we can’t just 10x the internet.

Types of Data Shortages

  1. Code: We’ve scraped all of GitHub.
  2. High-Quality Text: Books, scientific papers, and high-quality journalism are finite.
  3. Reasoning Chains: Explicit step-by-step reasoning (Chain-of-Thought) is rare in natural human writing.

What is Synthetic Data?

Synthetic data is information that’s artificially manufactured rather than generated by real-world events. In the context of LLMs, it usually involves using a very capable model (like GPT-4) to generate training examples for a smaller model, or to generate data that covers “gaps” in human knowledge.

Examples of Synthetic Data Usage

  • Microsoft’s Phi Models: Phi-1, Phi-2, and Phi-3 were trained heavily on “textbook quality” synthetic data generated to teach reasoning and coding fundamentals. The result? Small models that punch way above their weight class.
  • Self-Correction: Generating a solution, checking it with a code interpreter or verifier, and then adding the correct path to the training set.
  • Meta’s Llama 3: Used synthetic data for alignment and fine-tuning, filtering out bad responses and amplifying good ones.

The Model Collapse Risk

Critics argue that training AI on AI output leads to Model Collapse—a degenerative process where models lose touch with the tails of the distribution (rare events/knowledge) and converge on “average” non-sense.

“It’s like making a photocopy of a photocopy. Eventually, the image degrades into noise.”

However, recent research suggests this only happens if you use unfiltered synthetic data.

Curated Synthetic Data: The Smart Way

The industry has moved from “generate everything” to “curate and verify.”

The Recipe for Good Synthetic Data

  1. Teacher Model: Use a strong model (e.g., Claude 3.5 Opus) to generate samples.
  2. Constraint/Prompt: Ask for specific, rare scenarios (e.g., “Write a Python script using the pandas library to handle a specific edge case in time-series data”).
  3. Verification:
    • Code: Run the code. If it passes unit tests, keep it.
    • Math: Verify the steps logically.
    • Reasoning: Use a different model or heuristic to grade the output.
  4. Filtering: Discard the bottom 50% of generations. Only train on the top tier.

Techniques in Practice

1. Evol-Instruct

A method where simple human instructions are iteratively rewritten by an AI to be more complex, adding constraints, rare requirements, or reasoning steps.

  • Input: “Write a calculator.”
  • Evolution 1: “Write a calculator in Python.”
  • Evolution 2: “Write a calculator in Python that handles complex numbers and logs errors to a file.”

2. Backtranslation

Take a target sentence (e.g., in French), have a model translate it to English, then translate it back. If the meaning is preserved but the phrasing is new, you have a new training pair.

The Future is Synthetic

Jensen Huang (NVIDIA CEO) has stated that synthetic data generation is one of the primary use cases for their H100 clusters. We are moving from “mining” data to “manufacturing” it.

Benefits

  • Privacy: Synthetic medical records can be shared without HIPAA violations.
  • Bias Control: You can explicitly program diversity into the generated dataset.
  • Infinite Scale: You are limited only by compute, not by the number of humans on Earth.

In 2025, the best models aren’t just reading the internet—they are reading textbooks written specifically for them, by their predecessors.