What It Is
Positional encodings inject information about token order into the Transformer, which is otherwise permutation-equivariant (pure attention has no notion of position).
Why It Matters
Without positional information, a Transformer would produce identical outputs regardless of word order. Positional encodings or embeddings are essential for sequence understanding tasks.
How It Works
The original Transformer uses sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(...). These are added to token embeddings. Learned positional embeddings are an alternative.
Modern variants include:
- RoPE (Rotary Position Embedding) — rotates query/key vectors; enables length generalization and is used in LLaMA, Qwen, Gemma.
- ALiBi — adds a linear bias to attention scores based on relative distance.
- NoPE — no positional encoding at all, relying on the model to infer order from context.
Key Sources
Related Concepts
Open Questions
- Best approach for long-context extrapolation beyond training length