Sliding Window Attention (SWA)

What It Is

Sliding Window Attention (SWA) restricts each token’s attention to a fixed window of $W$ recent tokens instead of the full sequence. Rather than attending to all previous tokens (standard attention, $O (L^{2})$ cost), each token attends only to the $W$ most recent positions ( $O (W)$ cost, constant regardless of sequence length).

Why It Matters

Standard attention is quadratic in sequence length. A 32K-token sequence requires 32K times more KV cache than a 1K sequence and proportionally more compute per inference step. SWA caps both: the KV cache stays bounded at $W$ entries per layer, and attention cost is $O (W)$ regardless of how long the sequence grows. Information beyond the window isn’t lost — it percolates forward through stacked layers.

How It Works

For position $i$ , attention is restricted to the window $[max (0, i - W), i]$ :

$SWA (Q_{i}, K, V) = softmax (\frac{Q _{i} K _{[m a x (0, i - W) : i]}^{⊤}}{d _{k}}) V_{[m a x (0, i - W) : i]}$

Layer-wise propagation: Each layer can see $W$ tokens back. After $k$ stacked layers, token $i$ can incorporate information from up to $k \times W$ positions back — because Layer 2 attends to Layer 1’s representations, which already encoded tokens $W$ positions prior. With $k = 32$ layers and $W = 4096$ , the effective context span is $32 \times 4096 = 131, 072$ tokens.

Rolling buffer cache: With a fixed window, the KV cache only needs $W$ slots. Keys and values for position $i$ are stored at slot $i mod W$ , overwriting old entries when $i > W$ . For a 32K-token sequence with $W = 4096$ , this reduces cache size by $32768/4096 = 8 \times$ .

The key assumption: most attention weights are concentrated on nearby tokens in practice. Long-range attention from tokens thousands of positions back tends to carry low weight. SWA formalizes this empirical observation by hard-constraining the attention window.

Key Sources

mistral-7b — uses SWA with W=4096 and a rolling buffer cache
attention-is-all-you-need
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding
flash-attention-2
flash-attention-fast-and-memory-efficient-exact-attention
gqa-grouped-query-attention
numina-counting-text-to-video
rope-rotary-position-embedding
alphafold-2-protein-structure-prediction
alibi-train-short-test-long

ML Wiki

Explorer

Sliding Window Attention (SWA)

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Sliding Window Attention (SWA)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks