The Transformer Architecture: How Attention Changed Everything
A clear explanation of the transformer model — the architecture behind GPT, BERT, and virtually every modern LLM.
The Transformer Architecture: How Attention Changed Everything
In 2017, a paper titled “Attention Is All You Need” introduced the transformer. It replaced recurrent networks (RNNs, LSTMs) as the dominant architecture for sequence modeling. Today, transformers power GPT-4, Claude, Gemini, BERT, Whisper, Stable Diffusion, and essentially every major AI breakthrough of the last seven years.
Understanding transformers is understanding modern AI.
The Problem With RNNs
Before transformers, sequence models processed tokens one at a time, left to right. To understand a word, the model had to “remember” everything that came before it — encoded in a fixed-size hidden state.
Problems:
- Long-range dependencies — by the time you reach token 500, token 1 has largely been forgotten
- Sequential computation — can’t parallelize; token N depends on token N-1
- Vanishing gradients — gradients decay over long sequences during training
Transformers solve all three with a single idea: attention.
Self-Attention: The Core Mechanism
Self-attention lets every token in a sequence attend to every other token simultaneously. No more sequential bottleneck.
For each token, the model computes three vectors:
- Query (Q) — what this token is looking for
- Key (K) — what this token offers to others
- Value (V) — the actual content to aggregate
The attention score between two tokens is the dot product of their Q and K vectors, scaled and softmaxed:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
The scaling by √d_k prevents the dot products from growing too large in high dimensions, which would push softmax into near-zero-gradient territory.
The result: each token gets a weighted sum of all Value vectors, where the weights reflect how relevant each other token is.
Multi-Head Attention
One attention head lets the model learn one kind of relationship. Multi-head attention runs H independent attention heads in parallel, concatenates the results, and projects back to the model dimension.
Each head can specialize:
- Head 1: syntactic dependencies
- Head 2: coreference resolution
- Head 3: semantic similarity
- … etc.
# Simplified multi-head attention
def multi_head_attention(X, W_Q, W_K, W_V, W_O, num_heads):
heads = []
for i in range(num_heads):
Q = X @ W_Q[i]
K = X @ W_K[i]
V = X @ W_V[i]
scores = softmax(Q @ K.T / sqrt(d_k))
heads.append(scores @ V)
return concat(heads) @ W_O
The Full Transformer Block
Each transformer layer contains:
- Multi-head self-attention — tokens attend to each other
- Add & Norm — residual connection + layer normalization
- Feed-forward network (FFN) — two linear layers with a nonlinearity (ReLU or GELU)
- Add & Norm — another residual + norm
The FFN is applied independently to each token position. It’s where most of the model’s parameters live (the ratio is typically 4:1 FFN:attention in terms of parameters).
Input → MHA → Add+Norm → FFN → Add+Norm → Output
↑___residual___↑ ↑___residual___↑
Positional Encoding
Self-attention is inherently order-agnostic — it treats the sequence as a set, not a sequence. Without position information, “the cat sat on the mat” and “the mat sat on the cat” look identical.
The fix: add positional encodings to token embeddings before the first layer.
Sinusoidal encodings (original paper): fixed patterns based on sine and cosine at different frequencies.
Learned positional embeddings (BERT, GPT): a trainable embedding table, one vector per position.
Rotary Position Embedding (RoPE) (modern LLMs): encodes position into Q and K directly, enabling better extrapolation to longer sequences.
Encoder vs Decoder vs Both
The original transformer had both encoder and decoder halves:
| Type | Examples | Use case |
|---|---|---|
| Encoder only | BERT, RoBERTa | Classification, embeddings, understanding |
| Decoder only | GPT, Llama, Claude | Text generation, completion |
| Encoder-decoder | T5, BART | Translation, summarization |
Decoder-only models use causal (masked) attention — each token can only attend to previous tokens, not future ones. This ensures autoregressive generation works correctly.
Scaling
Transformers scale remarkably well. The scaling laws (Kaplan et al., 2020; Chinchilla, 2022) show that model performance improves predictably as you increase:
- Number of parameters
- Training data
- Compute budget
This predictability is part of why the AI industry has been able to plan model training runs years in advance.
Key hyperparameters:
d_model: embedding dimension (e.g., 4096 for Llama 3 8B)n_layers: number of transformer blocks (e.g., 32)n_heads: number of attention heads (e.g., 32)d_ff: feed-forward hidden size (e.g., 14336)context_length: max tokens the model can attend to
Conclusion
The transformer’s genius is its simplicity: replace sequential computation with parallel attention, let the model learn what to attend to, and stack many layers. The rest — scale, data, RLHF — built on top of this foundation.
Everything in modern AI starts here.