Tokenization: How AI Reads Text
Why ChatGPT can't spell 'strawberry' and why math is hard for LLMs. It all starts with how they see text.
Tokenization: How AI Reads Text
When you type “Hello, World!” into ChatGPT, the AI doesn’t see those letters. It doesn’t see “H-e-l-l-o.” It sees a sequence of numbers, like [15496, 11, 4435, 0].
This process of converting text into numbers is called Tokenization, and it is the foundational first step of all Large Language Models (LLMs).
What is a Token?
A token is the basic unit of text for an AI.
- It is NOT always a word.
- It is NOT always a character.
- It is a chunk of characters that frequently appear together.
Examples (using OpenAI’s tokenizer):
- “apple” = 1 token
- “friendship” = 1 token
- “antidisestablishmentarianism” = 5 tokens (
ant,id,is,est,ablishmentarianism… roughly) - “123” = 1 token
- “1234” = 2 tokens (
12,34depending on the model)
Rule of Thumb: 1,000 tokens is approximately 750 words in English.
Why Tokenization Matters
1. The “Strawberry” Problem
You might have seen viral videos of AI failing to count the number of ‘r’s in “strawberry.”
- Human: Sees “s-t-r-a-w-b-e-r-r-y”.
- AI: Sees the token
[strawberry]. To the AI, “strawberry” is a single indivisible ID. It doesn’t inherently know how many letters are inside that ID, unless it has memorized the spelling specifically.
2. The Math Problem
LLMs are notoriously bad at arithmetic with large numbers.
If you ask it to add 1234 + 5678, it might struggle.
Why? Because 1234 might be tokenized as [12, 34] and 5678 as [56, 78].
The AI isn’t doing math on digits; it’s trying to predict the next token based on statistical patterns of pairs of numbers.
3. Multilingual Efficiency
Tokenizers are usually optimized for English.
- English: “Hello” is 1 token.
- Other scripts: A single Kanji character or a complex Hindi word might be multiple tokens (bytes), making it more “expensive” to process non-English languages in terms of context window limits.
Types of Tokenizers
- Word-level: Splits by spaces. (Vocabulary becomes too huge).
- Character-level: Splits every letter. (Sequences become too long).
- Subword (BPE - Byte Pair Encoding): The Goldilocks zone. It keeps common words whole (“the”, “apple”) but breaks down rare words into chunks (“un-friend-li-ness”).
Conclusion
Tokenization is a necessary compression hack. It allows models to process text efficiently, but it introduces weird blind spots. The next time an AI makes a silly spelling mistake or math error, don’t blame its intelligence—blame its tokenizer.