The Transformer Architecture: How Attention Changed Everything

In 2017, a paper titled “Attention Is All You Need” introduced the transformer. It replaced recurrent networks (RNNs, LSTMs) as the dominant architecture for sequence modeling. Today, transformers power GPT-4, Claude, Gemini, BERT, Whisper, Stable Diffusion, and essentially every major AI breakthrough of the last seven years.

Understanding transformers is understanding modern AI.

The Problem With RNNs

Before transformers, sequence models processed tokens one at a time, left to right. To understand a word, the model had to “remember” everything that came before it — encoded in a fixed-size hidden state.

Problems:

  • Long-range dependencies — by the time you reach token 500, token 1 has largely been forgotten
  • Sequential computation — can’t parallelize; token N depends on token N-1
  • Vanishing gradients — gradients decay over long sequences during training

Transformers solve all three with a single idea: attention.

Self-Attention: The Core Mechanism

Self-attention lets every token in a sequence attend to every other token simultaneously. No more sequential bottleneck.

For each token, the model computes three vectors:

  • Query (Q) — what this token is looking for
  • Key (K) — what this token offers to others
  • Value (V) — the actual content to aggregate

The attention score between two tokens is the dot product of their Q and K vectors, scaled and softmaxed:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The scaling by √d_k prevents the dot products from growing too large in high dimensions, which would push softmax into near-zero-gradient territory.

The result: each token gets a weighted sum of all Value vectors, where the weights reflect how relevant each other token is.

Multi-Head Attention

One attention head lets the model learn one kind of relationship. Multi-head attention runs H independent attention heads in parallel, concatenates the results, and projects back to the model dimension.

Each head can specialize:

  • Head 1: syntactic dependencies
  • Head 2: coreference resolution
  • Head 3: semantic similarity
  • … etc.
# Simplified multi-head attention
def multi_head_attention(X, W_Q, W_K, W_V, W_O, num_heads):
    heads = []
    for i in range(num_heads):
        Q = X @ W_Q[i]
        K = X @ W_K[i]
        V = X @ W_V[i]
        scores = softmax(Q @ K.T / sqrt(d_k))
        heads.append(scores @ V)
    return concat(heads) @ W_O

The Full Transformer Block

Each transformer layer contains:

  1. Multi-head self-attention — tokens attend to each other
  2. Add & Norm — residual connection + layer normalization
  3. Feed-forward network (FFN) — two linear layers with a nonlinearity (ReLU or GELU)
  4. Add & Norm — another residual + norm

The FFN is applied independently to each token position. It’s where most of the model’s parameters live (the ratio is typically 4:1 FFN:attention in terms of parameters).

Input → MHA → Add+Norm → FFN → Add+Norm → Output
         ↑___residual___↑  ↑___residual___↑

Positional Encoding

Self-attention is inherently order-agnostic — it treats the sequence as a set, not a sequence. Without position information, “the cat sat on the mat” and “the mat sat on the cat” look identical.

The fix: add positional encodings to token embeddings before the first layer.

Sinusoidal encodings (original paper): fixed patterns based on sine and cosine at different frequencies.

Learned positional embeddings (BERT, GPT): a trainable embedding table, one vector per position.

Rotary Position Embedding (RoPE) (modern LLMs): encodes position into Q and K directly, enabling better extrapolation to longer sequences.

Encoder vs Decoder vs Both

The original transformer had both encoder and decoder halves:

TypeExamplesUse case
Encoder onlyBERT, RoBERTaClassification, embeddings, understanding
Decoder onlyGPT, Llama, ClaudeText generation, completion
Encoder-decoderT5, BARTTranslation, summarization

Decoder-only models use causal (masked) attention — each token can only attend to previous tokens, not future ones. This ensures autoregressive generation works correctly.

Scaling

Transformers scale remarkably well. The scaling laws (Kaplan et al., 2020; Chinchilla, 2022) show that model performance improves predictably as you increase:

  • Number of parameters
  • Training data
  • Compute budget

This predictability is part of why the AI industry has been able to plan model training runs years in advance.

Key hyperparameters:

  • d_model: embedding dimension (e.g., 4096 for Llama 3 8B)
  • n_layers: number of transformer blocks (e.g., 32)
  • n_heads: number of attention heads (e.g., 32)
  • d_ff: feed-forward hidden size (e.g., 14336)
  • context_length: max tokens the model can attend to

Conclusion

The transformer’s genius is its simplicity: replace sequential computation with parallel attention, let the model learn what to attend to, and stack many layers. The rest — scale, data, RLHF — built on top of this foundation.

Everything in modern AI starts here.