You’re running an LLM server. A user asks for a 500-word answer, so your system reserves GPU memory for 2,000 tokens “just in case.” The actual response is 400 tokens. That reserved-but-unused space? Gone. Another request comes in asking for a short answer — but the GPU looks full. You can only run one or two requests at a time when you should be handling twenty. The math on your A100 says you have 80 GB. Your throughput suggests you have 16 GB. The rest is wasted by pre-reservation and fragmentation.
This paper identified that problem and fixed it by applying a solution operating systems invented in 1960.
The analogy
Think about how your computer runs multiple programs simultaneously even though RAM is finite.
Each program thinks it has its own giant block of contiguous memory. But the OS is lying to it — it’s using a page table to map the program’s logical addresses to scattered physical pages wherever RAM happens to be free. The program asks for byte 0 through byte 99,999 and gets a contiguous-looking address space. Under the hood, those bytes live in 25 different physical locations. The program doesn’t know and doesn’t care.
Before vLLM, LLM servers worked like a computer without virtual memory — every program had to claim its contiguous chunk upfront. The whole chunk was reserved even if only 10% was actually used.
vLLM’s insight: give the KV cache its own page table. Allocate memory one small block at a time as tokens are actually generated. No pre-reservation, no wasted space.
The mechanism
First, you need to understand what the KV cache actually is.
Every transformer layer computes Key and Value vectors for each token it processes. When generating autoregressively (one token at a time), you don’t recompute old tokens — you cache their K and V. So generating a 1000-token response requires storing K and V for all 1000 tokens in GPU RAM, growing one slot at a time as each new token is produced.
For LLaMA-13B: each token’s KV cache costs 1.7 GB across all layers for a single sequence. A 2000-token conversation is over 3 GB — just for the cache of one request.
The old way:
Before vLLM, systems reserved memory at request start time based on maximum possible length:
Request A: "Tell me about black holes" → max_tokens=2048
→ Reserve 2048 slots × ~0.8 MB = 1.6 GB (even if answer is 300 tokens)
Request B: "Summarize this article" → max_tokens=4096
→ Reserve 4096 slots × ~0.8 MB = 3.3 GB
GPU RAM: ████████████░░░░░░░░░░░░░░
used wasted (fragmented/reserved)
40% 60% gone — can't fit Request C
Two types of waste:
- Internal fragmentation: reserved tokens that never get generated
- External fragmentation: gaps between allocations too small to use
The paper found 60–80% of GPU memory was being wasted this way.
The PagedAttention fix:
PHYSICAL GPU MEMORY (divided into 16-token blocks)
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ Blk0 │ Blk1 │ Blk2 │ Blk3 │ Blk4 │ Blk5 │ Blk6 │ Blk7 │
│ │ (A) │ (free│ (B) │ (free│ (A) │ (free│ (B) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
Request A's block table:
Logical block 0 → Physical block 1
Logical block 1 → Physical block 5 ← non-contiguous! that's fine
(next block allocated when 16 more tokens are generated)
Request B's block table:
Logical block 0 → Physical block 3
Logical block 1 → Physical block 7
No pre-allocation. No reserved space. A physical block is handed out only when 16 tokens have actually been generated and need to be stored.
PagedAttention is the custom attention kernel that knows how to compute attention by following the block table — fetching K and V from non-contiguous physical locations and assembling the correct attention scores. It’s attention that understands paging.
The math that matters
Standard attention:
O = softmax(QK^T / √d) · V
where Q = what I’m looking for, K = what each token offers up for comparison, V = what each token actually says, √d = scaling to prevent gradient explosion.
Nothing about this formula requires K and V to be stored contiguously. The formula just multiplies — it doesn’t care where the numbers live in RAM. The only thing that required contiguous storage was the engineering assumption nobody questioned.
Memory waste with paging:
Waste ≤ (block_size - 1) tokens per sequence
With 16-token blocks, you waste at most 15 tokens at the end of each sequence (the partially-filled last block). That’s ≤4% overhead regardless of sequence length — compared to 60–80% with the old approach.
Walkthrough with actual numbers
Scenario: parallel sampling — generating 3 different responses to the same prompt.
Old system:
Prompt: "What is the capital of France?" (let's say 8 tokens of KV cache)
System reserves for 3 separate completions:
Response 1 buffer: [8 prompt tokens][up to 200 response tokens] = 208 slots
Response 2 buffer: [8 prompt tokens][up to 200 response tokens] = 208 slots
Response 3 buffer: [8 prompt tokens][up to 200 response tokens] = 208 slots
Total: 624 slots reserved. Prompt KV cache duplicated 3 times.
At 1.7 GB per sequence (LLaMA-13B): ~3.5 GB for the prompt KV alone, triplicated.
vLLM with copy-on-write:
Physical blocks:
[Prompt block 0] ← ref_count = 3 (shared by all 3 responses)
Block tables:
Response 1: [→ shared prompt blk] [→ unique response blk 1]
Response 2: [→ shared prompt blk] [→ unique response blk 2]
Response 3: [→ shared prompt blk] [→ unique response blk 3]
Total: 1 prompt copy + 3 response blocks.
Prompt memory: 1/3 what it was. Copy-on-write fires only when responses diverge.
The paper reports this reduces memory overhead for parallel sampling by up to 55%, which translates to up to 2.2× improvement in throughput for beam search and parallel sampling workloads.
What’s clever — find the instinct
The real insight isn’t that paging is useful — OS engineers knew that in 1961. The insight is recognizing that KV cache and process virtual memory have the same four properties:
- They grow incrementally (one token / one page at a time)
- They’re read-heavy once written (old tokens never change)
- Multiple users often share prefixes (system prompts / shared code libraries)
- Final size is unknown at allocation time
Once you see this mapping, the solution falls out of 60 years of OS research. The copy-on-write for beam search is free — you get it for nothing once the paging infrastructure is in place.
The instinct: “We’re doing pre-OS-era memory management on GPUs. The OS solved this problem decades ago. Why are we re-inventing worse wheels?”
Why didn’t someone do this earlier? Because early transformer sequences were short (BERT: 512 tokens). At 512 tokens, pre-allocating a fixed buffer was fine — waste was small. When models scaled to 4K, 32K, 128K context, the waste compounded but the engineering pattern didn’t change. Path dependency: nobody questioned the assumption once sequences got long.
Real quotes from the paper
“The KV cache memory for each request is huge and grows and shrinks dynamically… existing systems pre-allocate a contiguous chunk of memory based on the request’s maximum sequence length.”
Translation: The assumption was so baked in that nobody named it. Every existing system just did it. The contribution here is partly noticing the assumption and asking if it was necessary.
“We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.”
Translation: Most of your expensive GPU is sitting idle holding space for tokens that will never be generated. This is not a small optimization — it’s the primary bottleneck.
“PagedAttention allows storing continuous keys and values in non-contiguous memory space.”
Translation: The key-value pairs from attention still need to be logically sequential (you need them in order to compute attention), but they don’t need to be physically adjacent in RAM. The block table provides the mapping. This distinction — logical continuity vs. physical contiguity — is the whole trick.
“Memory waste only happens in the last block of a sequence. In practice, this results in near-optimal memory usage, with a mere waste of under 4%.”
Translation: Instead of 60–80% waste, you get under 4%. The improvement isn’t incremental — it’s a qualitative change in how many requests you can serve simultaneously.
Does it actually work?
| System | GPU | Throughput (tokens/sec) | vs. HuggingFace Transformers |
|---|---|---|---|
| HuggingFace Transformers | A10G | baseline | 1× |
| HuggingFace TGI (prev SOTA) | A10G | ~3–4× HF | ~3.5× |
| vLLM (PagedAttention) | A10G | up to 24× HF | 24× |
The 24× number is for LLaMA-7B with parallel sampling. For single-output generation, the gain is still 2–4× versus previous state-of-the-art systems like FasterTransformer and Orca. Same hardware, same per-request latency — dramatically more concurrent requests served.
In production at LMSYS Chatbot Arena: vLLM cut the number of GPUs needed to serve the same traffic by 50% while handling 30K–60K requests per day.
What doesn’t work:
Block size is a real tuning knob with tradeoffs. Too small (1 token per block): page table overhead dominates. Too large (512 tokens): internal fragmentation creeps back. The paper uses 16 tokens as default — empirical, not principled.
The attention kernel requires custom CUDA code that jumps across non-contiguous memory. Porting to TPUs or custom accelerators isn’t free — the original vLLM was CUDA-only.
Prefill (processing the input prompt before generation starts) still happens on contiguous memory in the original design. Only the decoding phase benefits from paging. Chunked prefill to address this is follow-on work.
So what?
If you’re deploying LLMs in production, this is the paper that changed the economics. vLLM is now the de facto standard for LLM serving — Hugging Face TGI, Ollama, and most inference providers use PagedAttention under the hood. If you’re not using it, you’re leaving 60–80% of your GPU memory effectiveness on the table.
The deeper lesson connects to FlashAttention, which made a similar move: instead of trying to approximate the math to do less computation, it asked “are we using the hardware correctly?” FlashAttention found that memory bandwidth (not compute) was the bottleneck for training. vLLM found that memory management (not compute) was the bottleneck for serving. Both papers beat state-of-the-art methods not by being cleverer about the math, but by being more honest about the hardware.
LLM serving was bottlenecked not by attention math but by a 1960s-era memory management mistake, and a 1960s-era solution fixed it.
Connections
- kv-cache — the resource being managed
- speculative-decoding — complementary inference optimization
- flash-attention — shares the hardware-awareness insight
- attention — the operation whose cache is being managed
Citation
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. https://arxiv.org/abs/2309.06180