Tokenization: How AI Reads Text

When you type “Hello, World!” into ChatGPT, the AI doesn’t see those letters. It doesn’t see “H-e-l-l-o.” It sees a sequence of numbers, like [15496, 11, 4435, 0].

This process of converting text into numbers is called Tokenization, and it is the foundational first step of all Large Language Models (LLMs).

What is a Token?

A token is the basic unit of text for an AI.

  • It is NOT always a word.
  • It is NOT always a character.
  • It is a chunk of characters that frequently appear together.

Examples (using OpenAI’s tokenizer):

  • “apple” = 1 token
  • “friendship” = 1 token
  • “antidisestablishmentarianism” = 5 tokens (ant, id, is, est, ablishmentarianism… roughly)
  • “123” = 1 token
  • “1234” = 2 tokens (12, 34 depending on the model)

Rule of Thumb: 1,000 tokens is approximately 750 words in English.

Why Tokenization Matters

1. The “Strawberry” Problem

You might have seen viral videos of AI failing to count the number of ‘r’s in “strawberry.”

  • Human: Sees “s-t-r-a-w-b-e-r-r-y”.
  • AI: Sees the token [strawberry]. To the AI, “strawberry” is a single indivisible ID. It doesn’t inherently know how many letters are inside that ID, unless it has memorized the spelling specifically.

2. The Math Problem

LLMs are notoriously bad at arithmetic with large numbers. If you ask it to add 1234 + 5678, it might struggle. Why? Because 1234 might be tokenized as [12, 34] and 5678 as [56, 78]. The AI isn’t doing math on digits; it’s trying to predict the next token based on statistical patterns of pairs of numbers.

3. Multilingual Efficiency

Tokenizers are usually optimized for English.

  • English: “Hello” is 1 token.
  • Other scripts: A single Kanji character or a complex Hindi word might be multiple tokens (bytes), making it more “expensive” to process non-English languages in terms of context window limits.

Types of Tokenizers

  1. Word-level: Splits by spaces. (Vocabulary becomes too huge).
  2. Character-level: Splits every letter. (Sequences become too long).
  3. Subword (BPE - Byte Pair Encoding): The Goldilocks zone. It keeps common words whole (“the”, “apple”) but breaks down rare words into chunks (“un-friend-li-ness”).

Conclusion

Tokenization is a necessary compression hack. It allows models to process text efficiently, but it introduces weird blind spots. The next time an AI makes a silly spelling mistake or math error, don’t blame its intelligence—blame its tokenizer.