Attention (Multi-Head Self-Attention)

The Problem

RNNs read a sentence like you’d read a foreign text you’ve never seen — one word at a time, building a running summary in your head. By the time you reach the end of a long sentence, the early words are compressed into a blurry residue of that summary. The model has no direct line back to word 1 when it’s processing word 50.

Attention solves this by refusing to summarize. Every token can look directly at every other token, paying whatever amount of attention it decides is useful.

Analogy

You’re in a meeting. Someone asks a question. You don’t re-read the entire meeting transcript — you scan back specifically for whoever spoke about that topic five minutes ago. Attention is that scan: structured, content-driven, direct.

The mechanism is exactly: “given what I’m trying to figure out (Query), who in this room knows something relevant (Key), and what did they actually say (Value)?”

Mechanism in Plain English

For every token in the sequence, compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I actually say?).
Compute similarity scores between the current token’s Query and every other token’s Key. High score = high relevance.
Scale those scores by $1/ d_{k}$ to stop them from getting so large that softmax flattens into uniform attention (which kills gradients).
Apply softmax to turn scores into a probability distribution — the attention weights. They sum to 1.
Take a weighted sum of all Value vectors using those weights. The result is the token’s new representation: a blend of what it “attended to.”
Run this whole thing in parallel across $h$ independent “heads,” each projecting Q/K/V into its own subspace. Concatenate the outputs and project back.

ASCII Diagram

           Single Attention Head
           
Token sequence: [The]  [cat]  [sat]  [on]  [the]  [mat]
                                ↑
                           current token "sat"
                           
  Query(sat) ──→  score with Key(The)  → 0.1
                  score with Key(cat)  → 0.6   ← high: subject matters
                  score with Key(sat)  → 0.1   (self)
                  score with Key(on)   → 0.1
                  score with Key(the)  → 0.05
                  score with Key(mat)  → 0.05
                              ↓ softmax
  Attention weights:  [0.11, 0.56, 0.11, 0.11, 0.06, 0.05]
                              ↓ weighted sum of Values
  Output(sat) = 0.56·V(cat) + 0.11·V(The) + 0.11·V(sat) + ...
  
  "sat" now carries information about "cat" (its subject)

Math with Translation

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

$Q K^{T}$ — matrix of dot products: how similar is each query to each key?
$/ d_{k}$ — scale down to prevent vanishing gradients in softmax ( $d_{k}$ is key dimension, e.g. 64)
$softmax (\dots)$ — normalize scores into weights that sum to 1
$\cdot V$ — weighted blend of value vectors

For multi-head: $MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) \cdot W_{O}$

where each $head_{i} = Attention (Q \cdot W_{Q}^{i}, K \cdot W_{K}^{i}, V \cdot W_{V}^{i})$

Each head uses its own learned projections $W_{Q}^{i}, W_{K}^{i}, W_{V}^{i} \in R^{d_{model} \times d_{k}}$ so it can specialize.

Concrete Walkthrough

Sequence: [I] [love] [Paris]. $d_{model} = 4$ , single head, $d_{k} = 4$ .

Suppose for the token “love” we have (simplified, made up):

Q_love = [1, 0, 1, 0]
K_I     = [1, 0, 0, 1]
K_love  = [0, 1, 1, 0]
K_Paris = [1, 0, 1, 0]

Dot products (Q_love · Kᵀ):

Q·K_I = 1·1 + 0·0 + 1·0 + 0·1 = 1
Q·K_love = 1·0 + 0·1 + 1·1 + 0·0 = 1
Q·K_Paris= 1·1 + 0·0 + 1·1 + 0·0 = 2

Scale by $4 = 2$ : scores = [0.5, 0.5, 1.0]

Softmax([0.5, 0.5, 1.0]):

e^0.5 ≈ 1.65, e^0.5 ≈ 1.65, e^1.0 ≈ 2.72, sum ≈ 6.02
weights ≈ [0.27, 0.27, 0.45]

“love” attends most to “Paris”. Its output is 0.45·V_Paris + 0.27·V_I + 0.27·V_love.

This is how the word “love” ends up carrying information about the object it precedes.

What’s Clever

The non-obvious insight is learned routing with no fixed structure.

Before attention, the model had to decide in advance how far back to look (convolutions with fixed kernels) or compress everything into one state (RNNs). The assumption was that locality matters or that a single bottleneck is enough.

Attention relaxes both assumptions simultaneously. The model learns which positions matter for which queries, end-to-end. A token about pronoun resolution learns to attend to the antecedent regardless of distance. A token about verb agreement learns to attend to the subject.

The second clever thing: multi-head. By running attention in parallel subspaces, different heads naturally specialize — one tracks syntactic dependencies, another tracks coreference, another tracks positional proximity. This wasn’t designed in; it emerges from training.

Code

Minimal scaled dot-product attention in PyTorch:

import torch
import torch.nn.functional as F
 
def attention(Q, K, V):
    d_k = Q.size(-1)                              # key/query dimension, e.g. 64
    scores = Q @ K.transpose(-2, -1) / d_k**0.5  # (seq, seq) similarity matrix, scaled
    weights = F.softmax(scores, dim=-1)           # normalize to sum=1 along key axis
    return weights @ V                            # weighted blend of value vectors
 
# Example: batch=1, 6 tokens, 1 head, d_k=d_v=8
Q = torch.randn(1, 6, 8)  # queries: what each token is looking for
K = torch.randn(1, 6, 8)  # keys: what each token advertises
V = torch.randn(1, 6, 8)  # values: what each token actually says
out = attention(Q, K, V)  # shape: (1, 6, 8) — one attended vector per token

Key Sources

attention-is-all-you-need
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — bidirectional self-attention applied to pre-training
flash-attention-fast-and-memory-efficient-exact-attention
numina-counting-text-to-video — diagnoses counting failures via cross-attention map analysis in video diffusion
rope-rotary-position-embedding
alibi-train-short-test-long — ALiBi adds a fixed linear bias to pre-softmax attention scores to encode relative distance; no positional embeddings needed
flash-attention-2
gqa-grouped-query-attention
mistral-7b
alphafold-2-protein-structure-prediction — Evoformer applies row attention (within-sequence), column attention (across MSA), and triangle self-attention to jointly reason about evolutionary and pairwise spatial relationships

Open Questions

How attention patterns relate to reasoning and circuit-level interpretability
Quadratic complexity in sequence length — active area for linear attention alternatives

ML Wiki

Explorer

Attention (Multi-Head Self-Attention)

The Problem

Analogy

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Attention (Multi-Head Self-Attention)

The Problem

Analogy

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks