Positional Encoding

What It Is

Positional encodings inject information about token order into the Transformer, which is otherwise permutation-equivariant (pure attention has no notion of position).

Why It Matters

Without positional information, a Transformer would produce identical outputs regardless of word order. Positional encodings or embeddings are essential for sequence understanding tasks.

How It Works

The original Transformer uses sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(...). These are added to token embeddings. Learned positional embeddings are an alternative.

Modern variants include:

RoPE (Rotary Position Embedding) — rotates query/key vectors; enables length generalization and is used in LLaMA, Qwen, Gemma.
ALiBi — adds a linear bias to attention scores based on relative distance.
NoPE — no positional encoding at all, relying on the model to infer order from context.

Key Sources

attention-is-all-you-need — introduces sinusoidal positional encodings, the original PE design
rope-rotary-position-embedding — RoPE: rotates Q/K vectors per-layer so relative position falls out of the dot product
alibi-train-short-test-long — ALiBi: drops PE entirely; adds a per-head linear distance penalty to attention scores; enables length extrapolation at inference

Open Questions

Best approach for long-context extrapolation beyond training length

ML Wiki

Explorer

Positional Encoding

What It Is

Why It Matters

How It Works

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Positional Encoding

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks