What It Is
Sliding Window Attention (SWA) restricts each token’s attention to a fixed window of recent tokens instead of the full sequence. Rather than attending to all previous tokens (standard attention, cost), each token attends only to the most recent positions ( cost, constant regardless of sequence length).
Why It Matters
Standard attention is quadratic in sequence length. A 32K-token sequence requires 32K times more KV cache than a 1K sequence and proportionally more compute per inference step. SWA caps both: the KV cache stays bounded at entries per layer, and attention cost is regardless of how long the sequence grows. Information beyond the window isn’t lost — it percolates forward through stacked layers.
How It Works
For position , attention is restricted to the window :
Layer-wise propagation: Each layer can see tokens back. After stacked layers, token can incorporate information from up to positions back — because Layer 2 attends to Layer 1’s representations, which already encoded tokens positions prior. With layers and , the effective context span is tokens.
Rolling buffer cache: With a fixed window, the KV cache only needs slots. Keys and values for position are stored at slot , overwriting old entries when . For a 32K-token sequence with , this reduces cache size by .
The key assumption: most attention weights are concentrated on nearby tokens in practice. Long-range attention from tokens thousands of positions back tends to carry low weight. SWA formalizes this empirical observation by hard-constraining the attention window.
Key Sources
-
mistral-7b — uses SWA with W=4096 and a rolling buffer cache
-
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding