The Problem
RNNs read a sentence like you’d read a foreign text you’ve never seen — one word at a time, building a running summary in your head. By the time you reach the end of a long sentence, the early words are compressed into a blurry residue of that summary. The model has no direct line back to word 1 when it’s processing word 50.
Attention solves this by refusing to summarize. Every token can look directly at every other token, paying whatever amount of attention it decides is useful.
Analogy
You’re in a meeting. Someone asks a question. You don’t re-read the entire meeting transcript — you scan back specifically for whoever spoke about that topic five minutes ago. Attention is that scan: structured, content-driven, direct.
The mechanism is exactly: “given what I’m trying to figure out (Query), who in this room knows something relevant (Key), and what did they actually say (Value)?”
Mechanism in Plain English
-
For every token in the sequence, compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I actually say?).
-
Compute similarity scores between the current token’s Query and every other token’s Key. High score = high relevance.
-
Scale those scores by to stop them from getting so large that softmax flattens into uniform attention (which kills gradients).
-
Apply softmax to turn scores into a probability distribution — the attention weights. They sum to 1.
-
Take a weighted sum of all Value vectors using those weights. The result is the token’s new representation: a blend of what it “attended to.”
-
Run this whole thing in parallel across independent “heads,” each projecting Q/K/V into its own subspace. Concatenate the outputs and project back.
ASCII Diagram
Single Attention Head
Token sequence: [The] [cat] [sat] [on] [the] [mat]
↑
current token "sat"
Query(sat) ──→ score with Key(The) → 0.1
score with Key(cat) → 0.6 ← high: subject matters
score with Key(sat) → 0.1 (self)
score with Key(on) → 0.1
score with Key(the) → 0.05
score with Key(mat) → 0.05
↓ softmax
Attention weights: [0.11, 0.56, 0.11, 0.11, 0.06, 0.05]
↓ weighted sum of Values
Output(sat) = 0.56·V(cat) + 0.11·V(The) + 0.11·V(sat) + ...
"sat" now carries information about "cat" (its subject)
Math with Translation
- — matrix of dot products: how similar is each query to each key?
- — scale down to prevent vanishing gradients in softmax ( is key dimension, e.g. 64)
- — normalize scores into weights that sum to 1
- — weighted blend of value vectors
For multi-head:
where each
Each head uses its own learned projections so it can specialize.
Concrete Walkthrough
Sequence: [I] [love] [Paris]. , single head, .
Suppose for the token “love” we have (simplified, made up):
Q_love = [1, 0, 1, 0]
K_I = [1, 0, 0, 1]
K_love = [0, 1, 1, 0]
K_Paris = [1, 0, 1, 0]
Dot products (Q_love · Kᵀ):
- Q·K_I = 1·1 + 0·0 + 1·0 + 0·1 = 1
- Q·K_love = 1·0 + 0·1 + 1·1 + 0·0 = 1
- Q·K_Paris= 1·1 + 0·0 + 1·1 + 0·0 = 2
Scale by : scores = [0.5, 0.5, 1.0]
Softmax([0.5, 0.5, 1.0]):
- e^0.5 ≈ 1.65, e^0.5 ≈ 1.65, e^1.0 ≈ 2.72, sum ≈ 6.02
- weights ≈ [0.27, 0.27, 0.45]
“love” attends most to “Paris”. Its output is 0.45·V_Paris + 0.27·V_I + 0.27·V_love.
This is how the word “love” ends up carrying information about the object it precedes.
What’s Clever
The non-obvious insight is learned routing with no fixed structure.
Before attention, the model had to decide in advance how far back to look (convolutions with fixed kernels) or compress everything into one state (RNNs). The assumption was that locality matters or that a single bottleneck is enough.
Attention relaxes both assumptions simultaneously. The model learns which positions matter for which queries, end-to-end. A token about pronoun resolution learns to attend to the antecedent regardless of distance. A token about verb agreement learns to attend to the subject.
The second clever thing: multi-head. By running attention in parallel subspaces, different heads naturally specialize — one tracks syntactic dependencies, another tracks coreference, another tracks positional proximity. This wasn’t designed in; it emerges from training.
Code
Minimal scaled dot-product attention in PyTorch:
import torch
import torch.nn.functional as F
def attention(Q, K, V):
d_k = Q.size(-1) # key/query dimension, e.g. 64
scores = Q @ K.transpose(-2, -1) / d_k**0.5 # (seq, seq) similarity matrix, scaled
weights = F.softmax(scores, dim=-1) # normalize to sum=1 along key axis
return weights @ V # weighted blend of value vectors
# Example: batch=1, 6 tokens, 1 head, d_k=d_v=8
Q = torch.randn(1, 6, 8) # queries: what each token is looking for
K = torch.randn(1, 6, 8) # keys: what each token advertises
V = torch.randn(1, 6, 8) # values: what each token actually says
out = attention(Q, K, V) # shape: (1, 6, 8) — one attended vector per tokenKey Sources
-
numina-counting-text-to-video — diagnoses counting failures via cross-attention map analysis in video diffusion
Related Concepts
Open Questions
- How attention patterns relate to reasoning and circuit-level interpretability
- Quadratic complexity in sequence length — active area for linear attention alternatives