What It Is

The Transformer is an attention-based neural network architecture for sequence modeling that replaces recurrence and convolutions with stacked self-attention and feed-forward layers. Introduced by Vaswani et al. (2017), it solved the parallelization problem that made RNNs slow to train — enabling scaling to billions of parameters and becoming the dominant architecture for NLP, vision, audio, and multimodal systems.

Why It Matters

Virtually every frontier language model (GPT, BERT, LLaMA, PaLM, Claude, Gemini) is built on the Transformer. Its full parallelism during training is the reason scaling works: you can use thousands of GPUs simultaneously, whereas RNNs require sequential computation. Understanding the Transformer is foundational to understanding modern ML — LoRA adapts its weight matrices, FlashAttention optimizes its attention kernel, and KV-caching exploits its key/value structure at inference time.

How It Works

Three Variants

The original Transformer paper introduced an encoder-decoder architecture for translation. Subsequent models have specialized:

  • Encoder-only (BERT-style): Full bidirectional attention over all tokens. Best for classification, embeddings, retrieval. Each token attends to every other token in both directions.
  • Decoder-only (GPT-style): Causal (left-to-right) masking prevents attending to future tokens. Used for autoregressive generation — every frontier LLM is decoder-only.
  • Encoder-decoder (T5/BART-style): Encoder processes the input with full attention; decoder generates output with causal attention plus cross-attention to encoder states. Used for summarization, translation, structured generation.

Layer Structure

Each Transformer layer contains two sublayers, each wrapped in a residual connection + LayerNorm:

  1. Multi-head self-attention — attends across all positions in the sequence simultaneously.
  2. Position-wise FFN — two linear layers with GeLU nonlinearity applied identically at each position: expand to 4× d_model, contract back.
Input tokens (e.g. "The cat sat")
         |
[Token Embedding: d_model=512 per token]
         +
[Positional Encoding: sinusoid or learned]
         |
         v
+---------------------------------------------+
|  ENCODER LAYER x6 (or DECODER LAYER x6)     |
|                                             |
|  1) Multi-Head Self-Attention               |
|     Each token creates Q, K, V (dim 64)     |
|     Q × Kᵀ → scores → softmax → × V        |
|     8 heads run in parallel, concatenated   |
|     + residual, then LayerNorm              |
|                                             |
|  [DECODER ONLY: Cross-Attention here]       |
|     Q from decoder, K/V from encoder output |
|                                             |
|  2) Feed-Forward Network                    |
|     Linear(512→2048) → GeLU → Linear(2048→512) |
|     Applied per-position independently      |
|     + residual, then LayerNorm              |
+---------------------------------------------+
         |
    Final representations
         |
[Linear + Softmax → next token probabilities]

Multi-Head Attention Mechanism

For each head i, project Q, K, V to dimension d_k = d_model / h_heads:

Symbol translations:

  • Q — query matrix: “what am I looking for?”
  • K — key matrix: “what do I offer for comparison?”
  • V — value matrix: “what information do I pass along if selected?”
  • QKᵀ — dot product of every query against every key; O(n²) in sequence length
  • √d_k — scaling factor; prevents dot products from growing too large and saturating the softmax into near-zero gradients
  • softmax(...) — converts raw scores to attention weights summing to 1 per query

All heads concatenated: MultiHead(Q,K,V) = Concat(head₁, …, head₈) × W_O

Eight heads learn to specialize: one head tracks syntax, another coreference, another positional proximity. A single head would have to average over all these, losing discriminative power.

Numeric Walkthrough

Trace attention for “cat” in [“cat”, “sat”, “on”] with d_k = 4:

Embeddings (simplified):
  cat = [0.9, 0.2, 0.1, 0.8]
  sat = [0.3, 0.7, 0.6, 0.1]
  on  = [0.5, 0.4, 0.8, 0.3]

QKᵀ scores for "cat" query:
  cat·cat = 0.81 + 0.04 + 0.01 + 0.64 = 1.50
  cat·sat = 0.27 + 0.14 + 0.06 + 0.08 = 0.55
  cat·on  = 0.45 + 0.08 + 0.08 + 0.24 = 0.85

Scale by 1/√4 = 0.5:
  [0.75, 0.275, 0.425]

Softmax → attention weights:
  [0.427, 0.265, 0.308]

New "cat" = 0.427×cat_values + 0.265×sat_values + 0.308×on_values
          = context-enriched "cat" vector (no longer just cat)

Causal Masking (Decoder-Only)

In GPT-style models, the attention matrix is masked so position i can only attend to positions ≤ i. This is implemented by adding −∞ to the attention logits at future positions before softmax, making them zero-weight. This is what makes autoregressive generation possible: position 5 produces probabilities for token 6 without “seeing” tokens 6+.

Residual Connections + LayerNorm

Every sublayer output is added back to its input before normalization. Without residuals, gradients vanish through 6+ layers. LayerNorm normalizes each token’s representation to zero mean and unit variance, computed per-token across the embedding dimension (not across the batch as in BatchNorm). This stabilizes training regardless of sequence length or batch size.

Why It Replaced RNNs

PropertyRNN/LSTMTransformer
Parallelism during trainingNone — sequential per stepFull — all positions simultaneously
Long-range dependenciesDegraded by O(n) path lengthO(1) path — any two tokens attend directly
Max path lengthn (information travels step-by-step)1 (direct attention)
GPU utilization~10% (sequential bottleneck)~60-80% (dense matrix ops)
Memory scalingO(n)O(n²) attention matrix

The O(n²) memory cost is the Transformer’s original sin. It’s why FlashAttention, PagedAttention, and the entire efficient-attention subfield exist.

Scaling Behaviour

The Transformer exhibits predictable power-law scaling: loss improves as ~N^(-0.076) with parameters and ~D^(-0.095) with training tokens. This regularity is what made GPT-3 (175B), PaLM (540B), and LLaMA possible — you could predict the outcome before running the experiment. No comparable scaling laws exist for RNNs, partly because training instabilities made reaching very large scales impractical.

What’s Clever

The non-obvious move was the total elimination of recurrence. Attention had been used alongside RNNs since 2014 (Bahdanau et al.) — as a supplement to recurrence, not a replacement. The insight: recurrence was the bottleneck forcing sequential computation. Self-attention applied directly to token embeddings turned out to be sufficient. The paper’s title “Attention Is All You Need” was a provocation about this choice.

Common misconception: LayerNorm is applied after the residual (“Post-LN” in the original paper). Modern practice switched to “Pre-LN” (normalize before the sublayer, inside the residual branch) because it trains more stably at large scale — this is what LLaMA and most modern LLMs use.

Key Sources

  • attention — the scaled dot-product and multi-head attention mechanism
  • positional-encoding — how position information is injected
  • flash-attention — IO-aware attention kernel that reduces memory from O(n²) to O(n)
  • kv-cache — caches K/V tensors at inference to avoid recomputation
  • scaling-laws — power-law relationships that make Transformer training predictable
  • ssm-mamba — linear-time alternative to attention for long sequences

Open Questions

  • Can attention be replaced entirely for long sequences? (See Mamba/SSMs — competitive but not yet dominant)
  • Optimal positional encoding: RoPE and ALiBi generalize better to longer contexts than sinusoidal
  • Whether sparse or linear attention variants can match dense attention at scale
  • Whether Pre-LN vs Post-LN matters beyond training stability (ongoing research)