Inference efficiency is a stack of mostly independent optimizations, each addressing a different bottleneck. This path works from the most fundamental (eliminating redundant computation) up to system-level memory management and numerical approximations.


Step 1 — KV Cache

kv-cache

Start here. The KV cache is the baseline optimization that makes autoregressive generation tractable at all. Without it, generating token N requires recomputing attention over all N-1 previous tokens on every step — O(N²) total. The KV cache stores computed K and V tensors, reducing decode-phase cost to O(1) per step. Every subsequent optimization in this path either reduces cache memory or improves cache utilization.


Step 2 — Flash Attention

flash-attention

The KV cache solves the compute redundancy problem. Flash Attention solves the memory bandwidth problem. The standard attention computation materializes the full N×N attention matrix in HBM (slow GPU memory). Flash Attention tiles the computation to run entirely in SRAM (fast on-chip memory), drastically reducing memory traffic. Both optimizations are needed: the KV cache for the decode phase, Flash Attention for the prefill phase.


Step 3 — Grouped Query Attention (GQA)

gqa-grouped-query-attention

The KV cache grows with sequence length and with the number of KV heads. GQA reduces the number of KV heads while keeping the full number of query heads — LLaMA-3 70B uses 8 KV heads instead of 64, reducing cache size 8x at a modest quality cost. This is an architectural choice baked in at training time, but it’s motivated entirely by inference-time memory constraints. Understand this before PagedAttention.


Step 4 — Speculative Decoding

speculative-decoding

So far, every token still requires one full target-model forward pass. Speculative decoding breaks this: a small draft model proposes K tokens, the large target model verifies all K in a single parallel pass. On typical text, 2–4 tokens are accepted per target call. Throughput improves without changing output quality — the rejection sampling scheme is provably exact, not an approximation.


Step 5 — PagedAttention / vLLM

pagedattention-vllm

At this point the remaining bottleneck is memory fragmentation. Naive KV cache allocation pre-reserves contiguous GPU memory per request — a 2K context gets the same allocation as a 4K context, wasting half the memory. PagedAttention treats the KV cache like OS virtual memory, allocating physical “pages” on demand. vLLM implements this and achieves 2–4x throughput improvement over naive serving, primarily by fitting more requests into the same GPU memory.


Step 6 — Quantization

The final axis: reduce the numerical precision of weights (and optionally activations and the KV cache). FP16 → INT8 → INT4 reduces memory footprint 2–4x with varying quality degradation, enabling larger batch sizes and lower latency. No wiki page yet — see the GPTQ and AWQ papers directly: