What It Is

Positional encodings inject information about token order into the Transformer, which is otherwise permutation-equivariant (pure attention has no notion of position).

Why It Matters

Without positional information, a Transformer would produce identical outputs regardless of word order. Positional encodings or embeddings are essential for sequence understanding tasks.

How It Works

The original Transformer uses sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(...). These are added to token embeddings. Learned positional embeddings are an alternative.

Modern variants include:

  • RoPE (Rotary Position Embedding) — rotates query/key vectors; enables length generalization and is used in LLaMA, Qwen, Gemma.
  • ALiBi — adds a linear bias to attention scores based on relative distance.
  • NoPE — no positional encoding at all, relying on the model to infer order from context.

Key Sources

Open Questions

  • Best approach for long-context extrapolation beyond training length