Inference efficiency is a stack of mostly independent optimizations, each addressing a different bottleneck. This path works from the most fundamental (eliminating redundant computation) up to system-level memory management, numerical approximations, and architectural choices that compress or dynamically route compute.
Step 1 — KV Cache
Start here. The KV cache is the baseline optimization that makes autoregressive generation tractable at all. Without it, generating token N requires recomputing attention over all N-1 previous tokens on every step — O(N²) total. The KV cache stores computed K and V tensors, reducing decode-phase cost to O(1) per step. Every subsequent optimization in this path either reduces cache memory or improves cache utilization.
Step 2 — Flash Attention
The KV cache solves the compute redundancy problem. Flash Attention solves the memory bandwidth problem. The standard attention computation materializes the full N×N attention matrix in HBM (slow GPU memory). Flash Attention tiles the computation to run entirely in SRAM (fast on-chip memory), drastically reducing memory traffic. Both optimizations are needed: the KV cache for the decode phase, Flash Attention for the prefill phase.
Step 3 — Grouped Query Attention (GQA)
The KV cache grows with sequence length and with the number of KV heads. GQA reduces the number of KV heads while keeping the full number of query heads — LLaMA-3 70B uses 8 KV heads instead of 64, reducing cache size 8x at a modest quality cost. This is an architectural choice baked in at training time, but it’s motivated entirely by inference-time memory constraints. Understand this before PagedAttention.
Step 4 — Mistral 7B
Mistral 7B is the first production model to combine GQA with sliding window attention (SWA) — attention limited to a fixed local window — in a single architecture. The result: a 7B model that outperforms LLaMA 2 13B on most benchmarks, runs faster at inference, and fits in less memory. Reading this after GQA makes the architectural motivation for both choices clear, and shows how inference constraints shape model design from the start.
Step 5 — Speculative Decoding
So far, every token still requires one full target-model forward pass. Speculative decoding breaks this: a small draft model proposes K tokens, the large target model verifies all K in a single parallel pass. On typical text, 2–4 tokens are accepted per target call. Throughput improves without changing output quality — the rejection sampling scheme is provably exact, not an approximation.
Step 6 — PagedAttention / vLLM
At this point the remaining bottleneck is memory fragmentation. Naive KV cache allocation pre-reserves contiguous GPU memory per request — a 2K context gets the same allocation as a 4K context, wasting half the memory. PagedAttention treats the KV cache like OS virtual memory, allocating physical “pages” on demand. vLLM implements this and achieves 2–4x throughput improvement over naive serving, primarily by fitting more requests into the same GPU memory.
Step 7 — Quantization
Reduce the numerical precision of weights and optionally activations and the KV cache. FP16 → INT8 → INT4 reduces memory footprint 2–4x with varying quality degradation, enabling larger batch sizes and lower latency. Quantization is orthogonal to the previous optimizations — it addresses model size rather than attention compute or memory fragmentation.
Step 8 — Knowledge Distillation
Quantization compresses by reducing precision. Knowledge distillation compresses by reducing model size itself. The student network is trained to match the soft probability outputs of the larger teacher network — not just the hard labels. The soft targets carry far more information per example than one-hot labels: the relative probabilities across classes encode what the teacher has learned about similarity. Distillation produces small models that outperform larger models trained from scratch on the same data.
Step 9 — QLoRA: Efficient Fine-Tuning of Quantized LLMs
qlora-efficient-finetuning-quantized-llms
QLoRA combines quantization with LoRA (low-rank adaptation) to fine-tune a 65B model on a single 48GB GPU. The base model is loaded in 4-bit NormalFloat precision (a quantization format optimized for normally distributed weights), and only the low-rank adapter weights are trained in full precision. This unlocks fine-tuning at the scale of frontier models on consumer hardware, and is now the dominant approach for parameter-efficient fine-tuning.
Step 10 — Mixture of Depths
mixture-of-depths-dynamic-compute-allocation
All previous optimizations apply uniform effort: every token runs through every layer. Mixture of Depths challenges this. A learned routing mechanism decides, per token per layer, whether to process it through the full transformer block or skip it entirely. Easy tokens (common words, repeated context) get less compute; hard tokens (rare words, complex reasoning) get more. This dynamic allocation maintains quality while reducing average FLOPs per forward pass — a fundamentally different efficiency axis than speed or memory.