What It Is
Positional encodings inject information about token order into the Transformer, which is otherwise permutation-equivariant (pure attention has no notion of position).
Why It Matters
Without positional information, a Transformer would produce identical outputs regardless of word order. Positional encodings or embeddings are essential for sequence understanding tasks.
How It Works
The original Transformer uses sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(...). These are added to token embeddings. Learned positional embeddings are an alternative.
Modern variants include:
- RoPE (Rotary Position Embedding) — rotates query/key vectors; enables length generalization and is used in LLaMA, Qwen, Gemma.
- ALiBi — adds a linear bias to attention scores based on relative distance.
- NoPE — no positional encoding at all, relying on the model to infer order from context.
Key Sources
- attention-is-all-you-need — introduces sinusoidal positional encodings, the original PE design
- rope-rotary-position-embedding — RoPE: rotates Q/K vectors per-layer so relative position falls out of the dot product
- alibi-train-short-test-long — ALiBi: drops PE entirely; adds a per-head linear distance penalty to attention scores; enables length extrapolation at inference
Related Concepts
Open Questions
- Best approach for long-context extrapolation beyond training length