Training Data: Garbage In, Garbage Out

There is a famous saying in computer science: “Garbage In, Garbage Out.” Nowhere is this truer than in Generative AI.

You can have the most advanced Transformer architecture, thousands of H100 GPUs, and the best engineers. But if you train your model on low-quality data, it will be stupid.

The Era of “Textbooks Are All You Need”

For years, the philosophy was “More Data = Better.” Common Crawl (a scrape of the entire internet) was the gold standard. It contains Wikipedia, but it also contains Reddit comments, spam sites, and conspiracy blogs.

In 2023, Microsoft released a paper called “Textbooks Are All You Need.” They trained a tiny model (phi-1) on highly curated, textbook-quality data (code and reasoning).

  • Result: The tiny model outperformed massive models that were trained on “junk” internet data.

What makes data “High Quality”?

  1. Factuality: Wikipedia is better than a random blog.
  2. Reasoning Steps: Data that explains why (Chain of Thought) is better than data that just gives the answer.
  3. Diversity: You need poetry, code, legal docs, and casual chat.
  4. Cleanliness: Removing HTML tags, “click here to subscribe,” and duplicate text.

Synthetic Data: The Ouroboros

We are running out of human text. We have scraped almost the whole internet. So, companies are now using Synthetic Data—using GPT-4 to write high-quality textbooks to train GPT-5.

  • Risk: Model Collapse. If an AI trains on AI output, errors can compound, leading to a “genetic drift” where the model becomes incoherent.
  • Benefit: We can generate infinite perfect data for things like coding or math, where correctness can be verified automatically.

Conclusion

The future of AI isn’t just bigger chips. It’s better data curation. The role of “Data Engineer” is evolving into “Data Curator”—someone who selects the curriculum for the digital mind.