How It Works September 25, 2025 ⏱ 2 min read

Context Windows: Why Token Limits Matter

The memory span of an AI: why models forget the beginning of the conversation and how new architectures are solving it.

contexttokenslimitations

Context Windows: Why Token Limits Matter

Every LLM has a hard limit called the Context Window. This is the maximum amount of text (measured in tokens) the model can consider at one time.

For GPT-3, it was 2,048 tokens (~1,500 words). For GPT-4o, it is 128,000 tokens (~96,000 words). For Gemini 1.5 Pro, it is 2,000,000 tokens.

Why is there a limit?

The limit isn’t arbitrary. It’s mathematical. Recall the Attention Mechanism. Every token attends to every other token. If you have $N$ tokens, the attention calculation grows quadratically: $N^2$.

1,000 tokens -> 1,000,000 calculations.
100,000 tokens -> 10,000,000,000 calculations.

Doubling the context length makes the computation 4x slower and uses 4x more memory. This is the “Quadratic Bottleneck.”

The “Rolling Window” Effect

What happens when you exceed the limit? The model is forced to “forget.” Most chat interfaces use a rolling window:

You talk for an hour.
The conversation exceeds the limit.
The system silently deletes the oldest messages from the prompt sent to the AI.
Suddenly, the AI forgets your name or the instructions you gave at the start.

The Needle in a Haystack

Just because a model accepts 100k tokens doesn’t mean it pays attention to all of them effectively. Researchers use the “Needle in a Haystack” test:

Take a long document (The Haystack).
Insert a random fact in the middle (The Needle). (“The secret code is 9942”).
Ask the model: “What is the secret code?”

Early long-context models failed this. They paid attention to the beginning and end but got “lost in the middle.” Modern models (Gemini 1.5, Claude 3) have nearly perfect recall.

Solutions for the Future

RAG (Retrieval-Augmented Generation): Don’t put everything in context. Store data in a database, search for the relevant 5 snippets, and feed only those to the context window.
Linear Attention: New architectures (like Mamba or RWKV) try to make attention cost $N$ instead of $N^2$, theoretically allowing infinite context.
Ring Attention: Splitting the context across multiple GPUs to handle millions of tokens.

Conclusion

The context window is the AI’s “working memory.” As it grows, we move from models that can read an article (2k tokens) to models that can read a book (100k tokens) to models that can read a corporation’s entire video archive (10M+ tokens).