What It Is
The Transformer is an attention-based neural network architecture for sequence modeling that replaces recurrence and convolutions with stacked self-attention and feed-forward layers. Introduced by Vaswani et al. (2017), it solved the parallelization problem that made RNNs slow to train — enabling scaling to billions of parameters and becoming the dominant architecture for NLP, vision, audio, and multimodal systems.
Why It Matters
Virtually every frontier language model (GPT, BERT, LLaMA, PaLM, Claude, Gemini) is built on the Transformer. Its full parallelism during training is the reason scaling works: you can use thousands of GPUs simultaneously, whereas RNNs require sequential computation. Understanding the Transformer is foundational to understanding modern ML — LoRA adapts its weight matrices, FlashAttention optimizes its attention kernel, and KV-caching exploits its key/value structure at inference time.
How It Works
Three Variants
The original Transformer paper introduced an encoder-decoder architecture for translation. Subsequent models have specialized:
- Encoder-only (BERT-style): Full bidirectional attention over all tokens. Best for classification, embeddings, retrieval. Each token attends to every other token in both directions.
- Decoder-only (GPT-style): Causal (left-to-right) masking prevents attending to future tokens. Used for autoregressive generation — every frontier LLM is decoder-only.
- Encoder-decoder (T5/BART-style): Encoder processes the input with full attention; decoder generates output with causal attention plus cross-attention to encoder states. Used for summarization, translation, structured generation.
Layer Structure
Each Transformer layer contains two sublayers, each wrapped in a residual connection + LayerNorm:
- Multi-head self-attention — attends across all positions in the sequence simultaneously.
- Position-wise FFN — two linear layers with GeLU nonlinearity applied identically at each position: expand to 4× d_model, contract back.
Input tokens (e.g. "The cat sat")
|
[Token Embedding: d_model=512 per token]
+
[Positional Encoding: sinusoid or learned]
|
v
+---------------------------------------------+
| ENCODER LAYER x6 (or DECODER LAYER x6) |
| |
| 1) Multi-Head Self-Attention |
| Each token creates Q, K, V (dim 64) |
| Q × Kᵀ → scores → softmax → × V |
| 8 heads run in parallel, concatenated |
| + residual, then LayerNorm |
| |
| [DECODER ONLY: Cross-Attention here] |
| Q from decoder, K/V from encoder output |
| |
| 2) Feed-Forward Network |
| Linear(512→2048) → GeLU → Linear(2048→512) |
| Applied per-position independently |
| + residual, then LayerNorm |
+---------------------------------------------+
|
Final representations
|
[Linear + Softmax → next token probabilities]
Multi-Head Attention Mechanism
For each head i, project Q, K, V to dimension d_k = d_model / h_heads:
Symbol translations:
Q— query matrix: “what am I looking for?”K— key matrix: “what do I offer for comparison?”V— value matrix: “what information do I pass along if selected?”QKᵀ— dot product of every query against every key; O(n²) in sequence length√d_k— scaling factor; prevents dot products from growing too large and saturating the softmax into near-zero gradientssoftmax(...)— converts raw scores to attention weights summing to 1 per query
All heads concatenated: MultiHead(Q,K,V) = Concat(head₁, …, head₈) × W_O
Eight heads learn to specialize: one head tracks syntax, another coreference, another positional proximity. A single head would have to average over all these, losing discriminative power.
Numeric Walkthrough
Trace attention for “cat” in [“cat”, “sat”, “on”] with d_k = 4:
Embeddings (simplified):
cat = [0.9, 0.2, 0.1, 0.8]
sat = [0.3, 0.7, 0.6, 0.1]
on = [0.5, 0.4, 0.8, 0.3]
QKᵀ scores for "cat" query:
cat·cat = 0.81 + 0.04 + 0.01 + 0.64 = 1.50
cat·sat = 0.27 + 0.14 + 0.06 + 0.08 = 0.55
cat·on = 0.45 + 0.08 + 0.08 + 0.24 = 0.85
Scale by 1/√4 = 0.5:
[0.75, 0.275, 0.425]
Softmax → attention weights:
[0.427, 0.265, 0.308]
New "cat" = 0.427×cat_values + 0.265×sat_values + 0.308×on_values
= context-enriched "cat" vector (no longer just cat)
Causal Masking (Decoder-Only)
In GPT-style models, the attention matrix is masked so position i can only attend to positions ≤ i. This is implemented by adding −∞ to the attention logits at future positions before softmax, making them zero-weight. This is what makes autoregressive generation possible: position 5 produces probabilities for token 6 without “seeing” tokens 6+.
Residual Connections + LayerNorm
Every sublayer output is added back to its input before normalization. Without residuals, gradients vanish through 6+ layers. LayerNorm normalizes each token’s representation to zero mean and unit variance, computed per-token across the embedding dimension (not across the batch as in BatchNorm). This stabilizes training regardless of sequence length or batch size.
Why It Replaced RNNs
| Property | RNN/LSTM | Transformer |
|---|---|---|
| Parallelism during training | None — sequential per step | Full — all positions simultaneously |
| Long-range dependencies | Degraded by O(n) path length | O(1) path — any two tokens attend directly |
| Max path length | n (information travels step-by-step) | 1 (direct attention) |
| GPU utilization | ~10% (sequential bottleneck) | ~60-80% (dense matrix ops) |
| Memory scaling | O(n) | O(n²) attention matrix |
The O(n²) memory cost is the Transformer’s original sin. It’s why FlashAttention, PagedAttention, and the entire efficient-attention subfield exist.
Scaling Behaviour
The Transformer exhibits predictable power-law scaling: loss improves as ~N^(-0.076) with parameters and ~D^(-0.095) with training tokens. This regularity is what made GPT-3 (175B), PaLM (540B), and LLaMA possible — you could predict the outcome before running the experiment. No comparable scaling laws exist for RNNs, partly because training instabilities made reaching very large scales impractical.
What’s Clever
The non-obvious move was the total elimination of recurrence. Attention had been used alongside RNNs since 2014 (Bahdanau et al.) — as a supplement to recurrence, not a replacement. The insight: recurrence was the bottleneck forcing sequential computation. Self-attention applied directly to token embeddings turned out to be sufficient. The paper’s title “Attention Is All You Need” was a provocation about this choice.
Common misconception: LayerNorm is applied after the residual (“Post-LN” in the original paper). Modern practice switched to “Pre-LN” (normalize before the sublayer, inside the residual branch) because it trains more stably at large scale — this is what LLaMA and most modern LLMs use.
Key Sources
- attention-is-all-you-need — original paper; full architecture with numeric walkthrough
- bert-pre-training-of-deep-bidirectional-transformers — encoder-only variant with CLS token and bidirectional attention
- llama-open-efficient-foundation-language-models — modern decoder-only variant with RoPE and Pre-LN
Related Concepts
- attention — the scaled dot-product and multi-head attention mechanism
- positional-encoding — how position information is injected
- flash-attention — IO-aware attention kernel that reduces memory from O(n²) to O(n)
- kv-cache — caches K/V tensors at inference to avoid recomputation
- scaling-laws — power-law relationships that make Transformer training predictable
- ssm-mamba — linear-time alternative to attention for long sequences
Open Questions
- Can attention be replaced entirely for long sequences? (See Mamba/SSMs — competitive but not yet dominant)
- Optimal positional encoding: RoPE and ALiBi generalize better to longer contexts than sinusoidal
- Whether sparse or linear attention variants can match dense attention at scale
- Whether Pre-LN vs Post-LN matters beyond training stability (ongoing research)