What It Is

Sliding Window Attention (SWA) restricts each token’s attention to a fixed window of recent tokens instead of the full sequence. Rather than attending to all previous tokens (standard attention, cost), each token attends only to the most recent positions ( cost, constant regardless of sequence length).

Why It Matters

Standard attention is quadratic in sequence length. A 32K-token sequence requires 32K times more KV cache than a 1K sequence and proportionally more compute per inference step. SWA caps both: the KV cache stays bounded at entries per layer, and attention cost is regardless of how long the sequence grows. Information beyond the window isn’t lost — it percolates forward through stacked layers.

How It Works

For position , attention is restricted to the window :

Layer-wise propagation: Each layer can see tokens back. After stacked layers, token can incorporate information from up to positions back — because Layer 2 attends to Layer 1’s representations, which already encoded tokens positions prior. With layers and , the effective context span is tokens.

Rolling buffer cache: With a fixed window, the KV cache only needs slots. Keys and values for position are stored at slot , overwriting old entries when . For a 32K-token sequence with , this reduces cache size by .

The key assumption: most attention weights are concentrated on nearby tokens in practice. Long-range attention from tokens thousands of positions back tends to carry low weight. SWA formalizes this empirical observation by hard-constraining the attention window.

Key Sources