Attention Is All You Need

Every language model you’ve used in the last eight years runs on an architecture described in a 15-page paper published in 2017 by a team at Google. The paper was titled “Attention Is All You Need.” The title was a provocation — it claimed that you could throw away the dominant paradigm of sequence modeling (recurrent networks, reading word by word) and replace it entirely with one mechanism. The claim turned out to be correct in a way that even the authors probably didn’t expect.

The core idea

The analogy: You’re taking an open-book exam. The question is “who invented the telephone?” You don’t re-read the entire textbook from page 1. You scan for relevant keywords — “telephone,” “inventor,” “Bell” — jump directly to the relevant section, read those paragraphs, and answer. You don’t have to process every piece of information in order; you can access any part of the “input” directly based on relevance to the current question.

Recurrent neural networks (RNNs), the dominant sequence model before 2017, did the opposite. They read text exactly like a person with severe amnesia: word by word, left to right, compressing everything into a single fixed-size “memory vector.” By the time you’d read 100 words, your memory of word #1 had been diluted through 99 transformations. Long-range dependencies were nearly impossible to learn.

The second problem was speed. RNNs are inherently sequential: you can’t compute the hidden state for word 5 until you’ve computed word 4. On GPUs — which are built for parallel computation — this was catastrophic. Training that should take hours took weeks.

The Transformer’s bet: what if instead of processing words sequentially, we let every word attend directly to every other word simultaneously? The model sees the whole sentence at once, computes pairwise relevance scores between all word pairs, and builds up context-enriched representations in one shot.

The mechanism, step by step:

Convert each input token into a vector (an embedding) of size 512.
Add position information (since we’re not reading sequentially, we inject position signals directly into the embeddings).
For each token, compute three things from its embedding: a Query (what am I looking for?), a Key (what do I offer to other tokens?), and a Value (what information do I actually pass along?).
For each token’s Query, compute a dot product against every other token’s Key. This gives a raw relevance score.
Scale those scores down by √64 (to prevent the numbers from getting too large for the softmax to work properly).
Softmax the scores to get attention weights that sum to 1.
Take a weighted sum of all tokens’ Values using those weights.
That weighted sum is the token’s new representation — enriched by context from the whole sequence.
Repeat this process in parallel across 8 “heads” (8 independent sets of Q/K/V matrices), each learning different kinds of relationships.
Concatenate all 8 heads’ outputs, project back to size 512.
Pass through a position-wise feedforward network (two linear layers with ReLU, expanding to size 2048 then contracting back to 512).
Stack 6 of these encoder layers. Each layer further refines the representations.

INPUT: "The cat sat on the mat"
         |
         v
[Embedding: 512-dim vectors for each token]
         +
[Positional Encoding: sine/cosine signals injected]
         |
         v
+--------+--------+--------+--------+--------+--------+
|     ENCODER LAYER (x6 stacked)                      |
|                                                      |
|  Each token gets Q, K, V vectors (dim 64 each)      |
|                                                      |
|  "The" queries every token's Key:                    |
|   The↔The  The↔cat  The↔sat  The↔on  The↔the  ...   |
|   [0.1]    [0.6]    [0.05]  [0.1]   [0.15]   ...   |
|                 ↓ (after softmax)                    |
|   weighted sum of all Value vectors                  |
|   = new rich representation of "The"                |
|                                                      |
|  All tokens computed IN PARALLEL (not sequential)    |
|                                                      |
|  → Add residual (original + attention output)        |
|  → LayerNorm                                         |
|  → Feedforward (expand to 2048, back to 512)         |
|  → Add residual, LayerNorm                           |
+------------------------------------------------------+
         |
         v
      Context-enriched representations for all tokens
         |
         v
[DECODER: cross-attends to encoder output, generates output one token at a time]

The math, translated:

The core formula:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

Q — the matrix of query vectors, one per token. Each query is asking: “what context do I need to understand myself?”
K — the matrix of key vectors. Each key advertises: “here’s what I can offer for comparison.”
QKᵀ — dot product of every query against every key. This is O(n²) — the quadratic cost. A 1,000-token sequence means 1,000,000 comparisons.
√d_k — scaling by the square root of the key dimension (64 in the base model). Without this, when d_k is large, dot products grow large in magnitude and push the softmax into regions with near-zero gradients, killing learning.
softmax(...) — converts raw scores to probabilities that sum to 1 per query. High-scoring tokens get most of the “attention budget.”
× V — the weighted sum. You’re grabbing a blend of Value vectors proportional to how relevant each token is.

For multi-head attention:

MultiHead(Q,K,V) = Concat(head₁, …, head₈) × W_O

where each headᵢ = Attention(Q×Wᵢ_Q, K×Wᵢ_K, V×Wᵢ_V)

Translation: run 8 independent attention computations with different learned projection matrices. Each head learns to look for a different kind of relationship (syntactic subject-verb, semantic coreference, positional proximity, etc.). Concatenate all 8, then project back down to size 512.

Walkthrough with actual numbers:

Trace the attention computation for a 3-token sequence: [“cat”, “sat”, “on”] with d_model=4 and d_k=4 (simplified from the real 512/64).

Token embeddings:
  cat = [0.9, 0.2, 0.1, 0.8]
  sat = [0.3, 0.7, 0.6, 0.1]
  on  = [0.5, 0.4, 0.8, 0.3]

Step 1: Compute QKᵀ (dot products between every pair)
  cat·cat = 0.9×0.9 + 0.2×0.2 + 0.1×0.1 + 0.8×0.8 = 1.50
  cat·sat = 0.9×0.3 + 0.2×0.7 + 0.1×0.6 + 0.8×0.1 = 0.55
  cat·on  = 0.9×0.5 + 0.2×0.4 + 0.1×0.8 + 0.8×0.3 = 0.85

  QKᵀ row for "cat": [1.50, 0.55, 0.85]

Step 2: Scale by 1/√d_k = 0.5
  Scaled: [0.75, 0.275, 0.425]

Step 3: Softmax
  Attention weights for "cat": [0.427, 0.265, 0.308]
  (cat attends most to itself: 42.7%, "on": 30.8%, "sat": 26.5%)

Step 4: Weighted sum of Values
  new_cat = 0.427 × [0.9, 0.2, 0.1, 0.8]
          + 0.265 × [0.3, 0.7, 0.6, 0.1]
          + 0.308 × [0.5, 0.4, 0.8, 0.3]
  new_cat = [0.618, 0.394, 0.448, 0.461]

This is no longer just "cat" — it's "cat in the context of what it sits near."

What’s clever — find the instinct:

The non-obvious move was the total elimination of recurrence. Attention mechanisms had been used alongside RNNs since 2014 (Bahdanau et al.). The standard thinking was: use an RNN to build up hidden states, then use attention to selectively focus those states. Attention was a supplement to recurrence, not a replacement.

The insight was: the recurrent part was actually the bottleneck. It was what forced sequential processing. What if attention, applied directly to the raw token embeddings, was sufficient on its own? The paper tests this — and it is.

“The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.”

But this creates a problem: without sequential processing, you lose position information. The fix is positional encoding: inject position signals directly into the embeddings before the first attention layer. The paper uses sinusoids:

“We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).”

The second non-obvious move was multi-head attention. A single attention head averages over all the relationships it finds. Eight heads, each with their own independent Q/K/V matrices, learn to specialize. One head learns syntax, another learns coreference, another learns positional proximity.

“Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”

Why the scaling factor:

“We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.”

Translation: without the √d_k scaling, the softmax saturates — it gives ~1.0 to one token and ~0.0 to everything else. The model can’t learn nuanced attention patterns; it just learns to hard-select one token.

Does it work? What breaks?

Model	Task	Score	vs. Previous Best	Training Cost
Transformer (big)	WMT 2014 EN→DE	28.4 BLEU	+2.0 over all ensembles	3.5 days, 8 GPUs
Transformer (big)	WMT 2014 EN→FR	41.8 BLEU	SOTA, single model	3.5 days, 8 GPUs
Transformer (4-layer)	English constituency parsing	92.7 F1	Better than most semi-supervised models	—

The 28.4 BLEU on English→German is +2.0 over the previous best, which was an ensemble of multiple models. The Transformer, as a single model, beat multi-model ensembles and trained at a fraction of the compute: 2.3×10¹⁹ FLOPs vs. 1.4×10²⁰ FLOPs for the previous best.

What doesn’t work:

The O(n²) attention cost is the original sin. A 1,000-token sequence requires 1,000,000 attention computations. A 100,000-token sequence requires 10 billion. This is why context windows were limited to 512 or 1,024 tokens for years — the memory requirement grows quadratically. The entire subfield of “efficient attention” (Longformer, BigBird, FlashAttention, linear attention) exists to solve the problem this paper created.

The paper is also coy about what “attention” is actually learning. Later interpretability research found that attention patterns are a poor proxy for what information actually flows through the model. What the model attends to and what it uses are different things.

So what?

If you’re building anything with language models today, you are using the Transformer. LoRA targets exactly the Wq, Wk, Wv weight matrices this paper introduced. Chain-of-thought prompting works because of the representational power this architecture enables at scale. When you read any paper that says “we use a standard Transformer architecture,” they mean this paper.

The Transformer’s bet was that parallelization through attention was worth the O(n²) cost — and that bet paid off, because real-world sequences are short enough (relative to model dimensionality d) that n² × d < n × d², the RNN’s cost. As soon as you have fast training, you can scale. As soon as you can scale, you find the scaling laws that predict performance. The scaling laws led to GPT-3, which led to instruction tuning, which led to ChatGPT. The whole chain traces back to this paper’s willingness to drop recurrence entirely.

The Transformer didn’t just solve machine translation — it gave every AI researcher a universal, parallelizable, scalable building block that turned out to work for text, images, audio, protein sequences, and video. “Attention is all you need” was a brag about translation. It turned out to be a statement about architecture universality.

Connections

transformer — introduces the Transformer architecture
attention — defines multi-head and scaled dot-product attention
lora — LoRA adapts the Q and V matrices introduced here
flash-attention — optimizes the attention computation from this paper
scaling-laws — scaling laws built on this architecture enabled GPT-3 and beyond

Citation

arXiv:1706.03762

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762

ML Wiki

Explorer

Attention Is All You Need

The core idea

Does it work? What breaks?

So what?

Connections

Citation

Graph View

Table of Contents

Backlinks