Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Concepts: positional-encoding | attention | long-context | inductive-bias Builds on: attention-is-all-you-need | rope-rotary-position-embedding Leads to: (context window extension research: YaRN, LongRoPE, LongFormer)

The problem

You train an LLM on 1024-token sequences. At deployment, users want to send 2048-token prompts. With standard positional encodings, positions 1025–2048 were never seen during training — the model encounters inputs it has no encoding for and its outputs become nonsense. Scaling training length is expensive: you need longer sequences in the batch, gradient checkpointing, more memory. The goal is to train short and deploy long. Sinusoidal encodings and learned embeddings both fail this test.

The core idea

The analogy. You’re a librarian who has memorized a rule: “recent checkout history matters more than old checkout history.” You don’t memorize “this patron is customer number 4,391 and returned book #17 on March 5th.” You just know: recent activity is weighted more heavily, and the older it is, the less it matters. That rule works whether you’re looking at 30 days of history or 300 days — you never had to “see” a 300-day window to apply it correctly. Your recency rule is distance-based, not position-based.

That’s ALiBi. Instead of encoding “I am at absolute position 47,” the model learns a simpler signal: “this token is 3 steps behind the current query, so penalize its attention score slightly; this token is 40 steps behind, penalize more.” The penalty grows with distance. At test time, the formula handles any distance naturally — because it was never anchored to a specific maximum position in the first place.

The mechanism.

Standard attention computes: $softmax (\frac{q _{i} K ^{T}}{d}) \cdot V$

ALiBi removes positional embeddings from the word embeddings entirely and instead adds a fixed bias to the pre-softmax attention scores:

$softmax (\frac{q _{i} K ^{T}}{d} - m_{h} \cdot [i - 0, i - 1, \dots, i - i]) \cdot V$

The bias vector for query at position $i$ across all key positions $j \leq i$ (causal attention) is: $bias [j] = - m_{h} \cdot (i - j)$

where $m_{h}$ is a small, fixed, non-learned slope specific to head $h$ . The bias is always ≤ 0. The further back a key token sits, the larger the penalty, the less attention it receives.

The slopes $m_{h}$ form a geometric sequence. For $H$ heads, the slopes are: $m_{h} = 2^{- 8 h / H}, h = 1, \dots, H$

For 8 heads: slopes = $2^{- 1}, 2^{- 2}, \dots, 2^{- 8}$ = $\frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \dots, \frac{1}{256}$ .

Each head has a different “focal length.” Head 1 (slope = 1/2) heavily penalizes distance — it focuses on nearby tokens. Head 8 (slope = 1/256) barely penalizes at all — it still sees far-away tokens. Together they cover multiple scales of context.

“ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance.”

Translation: position is encoded as a subtraction from the attention score, not an addition to the word vector. It lives in the attention computation, not the embedding.

STANDARD SINUSOIDAL: position stamped into embedding, then forgotten over 24 layers

Token:     [The]     [cat]     sat]     [on]     [mat]
Embedding: [0.3, ...] + [pos_0]    (absolute address baked in)
           position > 1023 at inference = out-of-distribution → garbage

────────────────────────────────────────────────────────────────

ALIBI: no position in embedding. Penalty added to attention scores.

Query position i=4 ("mat"), keys j=0..4, slope m=0.25:

  Key (j):   0       1       2       3       4
  Distance:  4       3       2       1       0
  Bias:     -1.0    -0.75   -0.5    -0.25    0.0

  Attention score = raw_QK_score + bias
  Softmax over these: nearby tokens win, distant tokens suppressed.

At test time, query position i=2000 (never seen in training):
  Key (j):   1996    1997    1998    1999    2000
  Distance:  4       3       2       1       0
  Bias:     -1.0    -0.75   -0.5    -0.25    0.0

  Identical bias pattern. Same attention behavior. No OOD inputs.

Walkthrough with real numbers. Training length = 4 tokens. Inference on 7 tokens (unseen positions 4–6). Head slope $m = 0.25$ .

Query is at position 6. Keys are at positions 0–6.

Step 1 — Compute biases:

j:    0      1      2      3      4      5      6
dist: 6      5      4      3      2      1      0
bias: -1.50  -1.25  -1.00  -0.75  -0.50  -0.25  0.00

Step 2 — Suppose raw $q_{6} K^{T} / d$ scores are all 1.0 (uniform content relevance for simplicity):

pre-softmax: [1.0-1.50, 1.0-1.25, 1.0-1.00, 1.0-0.75, 1.0-0.50, 1.0-0.25, 1.0-0.00]
           = [-0.50,   -0.25,    0.00,    0.25,    0.50,    0.75,    1.00]

Step 3 — Softmax (computing $e^{x} / \sum e^{x}$ ):

exp vals: 0.607  0.779  1.000  1.284  1.649  2.117  2.718
sum:      10.154
weights:  0.060  0.077  0.098  0.126  0.162  0.208  0.268

The most recent token gets 26.8% of attention; the farthest gets 6.0%.

Now compare position 3 (training-time query, max position seen):

biases: [-0.75, -0.50, -0.25, 0.00] → pre-softmax: [0.25, 0.50, 0.75, 1.00]
weights: 0.168, 0.216, 0.277, 0.356 (same shape, just over 4 tokens instead of 7)

The relative decay pattern is identical. The model learned at training that “bias = distance penalty” and that knowledge transfers unchanged to position 6. The formula is defined everywhere; it just extends.

What’s clever — find the instinct.

Here’s the reasoning path. Every prior position method encoded position as a feature: a vector you add to the embedding or multiply into the query/key. Adding a feature requires knowing what feature values represent unseen positions — and there’s no obvious answer.

The instinct: what if position isn’t a feature but a constraint? Attention already lets a model decide how much to attend to any token. What if you just said “attention to distant tokens is cheaper” — not as learned behavior but as a hard architectural prior? Then position is a penalty, not a learned parameter, and the penalty formula trivially extends to any distance.

The secondary insight: the slopes don’t need to be trained. A geometric sequence covers the range from “very local” to “global” efficiently. The model learns to extract meaning from the pattern; the pattern itself is fixed.

“We believe that ALiBi works by providing a relative position encoding in the attention layer in a simple and efficient manner.”

Translation: the bias gives the model relative distance information at every attention computation, in every layer — not once at the input like sinusoidal, not re-encoded at each layer via rotation like RoPE. It’s the same distance signal, available everywhere, at zero parameter cost.

Why is 11% faster and 11% less memory? Sinusoidal encodings at length 2048 require computing position vectors for all 2048 positions. ALiBi at length 1024 computes those biases (just arithmetic on distances) for only 1024 positions. Fewer position calculations, smaller batch dimensions. The 11% savings on both time and memory come from training on half the sequence length while achieving equivalent test perplexity.

Does it work?

Setting	Baseline	ALiBi	Training cost vs. baseline
1.3B param model, eval at L=2048	Sinusoidal trained on L=2048: baseline PPL	ALiBi trained on L=1024: same PPL as baseline	11% faster, 11% less memory
WikiText-103, eval at training length	Sinusoidal (L=512): 18.67 PPL	ALiBi (L=512): 18.27 PPL (better in-domain too)	Same compute
WikiText-103, extrapolating to 3× training length	Sinusoidal: perplexity explodes	ALiBi: perplexity stays flat or improves	—

The in-domain result is notable: ALiBi isn’t just a “cope mechanism” for extrapolation. It beats sinusoidal even at the lengths it was trained on. The recency bias is a useful inductive prior, not a compromise.

What doesn’t work. ALiBi’s recency bias is a fixed inductive prior — it always penalizes distant attention. Tasks where long-range dependencies matter more than local context may suffer. Document-level coreference resolution, cross-document reasoning, tasks where the most important token is near the beginning: these could see accuracy drops because the model is architecturally discouraged from attending far back.

The slopes are also fixed heuristics (geometric sequence). The paper validates them empirically on language modeling, but optimal slopes for code, math, or structured generation could differ. There’s no learned slope mechanism in the base paper.

Finally, ALiBi’s extrapolation has limits — it doesn’t extend to arbitrarily long sequences. Empirically, strong extrapolation typically holds to about 2–3× the training length. At 10× training length, attention patterns degrade because the model has never been trained to recover from bias values that large.

So what?

If you’re building an LLM and want controlled memory at deploy time, ALiBi gives you a concrete tradeoff: train on L tokens, reliably deploy at up to ~2–3L. The 11% training speedup and memory savings are real, not marginal. For production systems where GPU cost is the constraint and users need modest context extensions (1K training → 2K inference), ALiBi is the simplest method with the cleanest guarantees. MPT and BLOOM-176B both adopted it exactly for this reason: predictable extrapolation without any fine-tuning infrastructure.

Recall how attention-is-all-you-need injected positional information by adding sinusoidal vectors to embeddings at the input — a “tattoo” that fades over 24 layers. Then rope-rotary-position-embedding moved the position signal into the dot product itself, using rotation to preserve relative distance algebraically. ALiBi takes the third path: forget the position signal in the embedding entirely, and encode distance directly as an attention penalty. Each approach is trading off expressiveness (learned, arbitrary position dependencies) against extrapolation (works at any length). ALiBi makes the most aggressive bet on recency-as-prior, and for language modeling at moderate lengths, that bet pays off.

ALiBi: remove the position tattoo, add a distance penalty, and the model learns to extrapolate for free.

Connections

positional-encoding — ALiBi is a non-parametric alternative to sinusoidal, learned, and rotary position encodings
attention — the bias is added to pre-softmax attention scores; no change to V projections or the rest of the architecture
long-context — ALiBi’s recency inductive bias enables length extrapolation at inference
inductive-bias — the geometric slope sequence encodes a fixed recency prior directly into the architecture
attention-is-all-you-need — sinusoidal PE is the baseline ALiBi supersedes for extrapolation
rope-rotary-position-embedding — RoPE is the concurrent approach that also targets relative position; extrapolation failure modes differ

Citation

arXiv:2108.12409

Press, O., Smith, N. A., & Lewis, M. (2021). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. ICLR 2022. https://arxiv.org/abs/2108.12409

ML Wiki

Explorer