Inference Efficiency

What It Is

Inference efficiency encompasses techniques to reduce the compute, memory, and latency cost of running large language models at serving time. Unlike training (a one-time cost), inference runs continuously at production scale — every token generated by every user is inference. Efficiency here determines whether serving a 70B model is economically viable or not.

Why It Matters

At scale, inference cost dominates LLM economics. OpenAI’s ChatGPT was estimated to cost ~$700K/day at peak 2023 usage. Efficient inference enables faster responses, lower cost per token, serving larger models on constrained hardware, and deploying models at the edge. Without inference optimization, frontier models would be economically unviable for consumer applications.

Latency vs. Throughput

These are distinct objectives that require different techniques:

Latency — time to first token, and time per subsequent token. Dominated by memory bandwidth (moving weights from HBM to SRAM on GPU). Users notice latency directly.
Throughput — tokens generated per second across all concurrent users. Dominated by compute utilization. Determines cost per token.

Optimizing for one can harm the other. Batching increases throughput but increases latency for early-arriving requests. Speculative decoding reduces latency but uses more total compute.

Serving tradeoff space:

High latency, high throughput   ←── Batching (vLLM, continuous batching)
Low latency, lower throughput   ←── Speculative decoding, small model
Memory-limited                  ←── Quantization, PagedAttention
Compute-limited                 ←── FlashAttention, GQA

Full Taxonomy of Techniques

1. KV Cache

During autoregressive generation, each new token attends to all previous tokens. Without caching, the key and value vectors for all prior tokens would be recomputed from scratch at each step. The KV cache stores these tensors in GPU memory after the first computation.

Cost: O(2 × n_layers × d_model × seq_len) memory per request
For a 70B model with 80 layers and seq_len=4096, KV cache ≈ 40 GB per request
This is why long-context serving is expensive — the cache scales with sequence length

See kv-cache for full details.

2. FlashAttention

The standard attention implementation materializes the full n×n attention matrix in HBM (GPU high-bandwidth memory). For n=2048, this is 2048² × 2 bytes = 8 MB per head — and HBM bandwidth is the bottleneck, not compute.

FlashAttention tiles the attention computation into blocks that fit in SRAM (the fast on-chip cache), performing the softmax and value aggregation without materializing the full matrix. Result: 2-4× speedup and O(n) memory instead of O(n²).

FlashAttention-2 adds better work partitioning across GPU warps, achieving ~70% GPU utilization vs. ~35% for standard attention.

See flash-attention for full details.

3. Speculative Decoding

Autoregressive generation is sequential — you can’t generate token 5 before token 4. But verification is parallel: given a complete sequence, you can score all tokens simultaneously.

Speculative decoding exploits this asymmetry:

A small draft model (7B) generates K tokens autoregressively (cheap but fast)
The large verifier model (70B) evaluates all K+1 positions in a single forward pass
Accept tokens where the draft was correct; reject and resample from the first mismatch
Net result: 1-K+1 tokens accepted per large-model forward pass (vs. 1 in vanilla decoding)

Draft model generates:  "The capital of France is [Paris]"  (5 tokens, fast)
                                                                ↑
Verifier checks all:    Accept "Paris", reject if wrong → resample from verifier
Speedup:               2-3× if draft accuracy ~80%

The key insight: verification is embarrassingly parallel, so the 70B model’s forward pass is “free” relative to the sequential bottleneck. Only works when the draft model frequently agrees with the verifier.

See speculative-decoding for full details.

4. Continuous Batching

Traditional batching waits for a batch of requests to arrive, processes all together, waits for all to finish. Requests that complete early leave GPU capacity idle while waiting for longer-running requests in the same batch.

Continuous batching (iteration-level scheduling) schedules at the token level: when a sequence in the batch reaches its end-of-sequence token, remove it immediately and insert a new waiting request. The batch is dynamically updated each forward pass. vLLM implements this with PagedAttention to manage variable-length KV caches.

Result: GPU utilization improves from ~60% (static batching) to ~90%+.

5. PagedAttention (vLLM)

The KV cache for different sequences has different lengths, making memory allocation fragile. Traditional implementations pre-allocate the maximum possible sequence length per request, wasting up to 60-80% of reserved memory on internal fragmentation.

PagedAttention maps KV cache blocks to non-contiguous physical memory pages (like virtual memory in OSes). Sequences share physical pages when prompt prefixes overlap. Memory waste drops to ~4% (only the last partial page per sequence).

Result: 2-4× more sequences can run simultaneously, directly improving throughput.

6. Quantization

Model weights are stored as FP16 (2 bytes each) by default. Quantization represents weights in lower precision:

Format	Bytes per weight	Memory (70B model)	Typical quality loss
FP16	2	140 GB	Baseline
INT8 (W8A16)	1	70 GB	<1% on most benchmarks
INT4 (W4A16)	0.5	35 GB	1-3% on reasoning tasks
FP8	1	70 GB	<0.5% (with hardware support)

Weights-only quantization (W4A16 = 4-bit weights, 16-bit activations) is the dominant technique: quantize weights for memory savings, dequantize to FP16 for computation. GPTQ and AWQ are the standard algorithms.

Quantization is a memory-bandwidth optimization primarily: a 70B model in INT4 fits on a single 40GB A100 vs. requiring two A100s at FP16.

7. Grouped-Query Attention (GQA)

Standard multi-head attention (MHA) uses H_kv = H_q KV heads, all stored in the KV cache. Multi-query attention (MQA) shares a single KV head across all Q heads, reducing cache size by H_q× but degrading quality.

GQA is the middle ground: group Q heads into G groups, each sharing one KV head. LLaMA-2 (70B) uses 8 KV heads for 64 query heads. KV cache shrinks 8×. Quality gap from MHA is negligible.

8. Tensor Parallelism + Pipeline Parallelism

For models too large for one GPU:

Tensor parallelism: Split weight matrices across GPUs; each GPU computes a slice of every layer. Requires all-reduce at each layer boundary. Good for latency (parallel), bad for small batches (all-reduce overhead).
Pipeline parallelism: Split layers across GPUs; GPUs process different micro-batches. Hides inter-GPU latency with pipelining. Better throughput, worse latency.

When to Use Each Technique

Constraint	Technique
Memory-limited (can’t fit model)	INT4 quantization, GQA
Latency-sensitive (chatbot)	Speculative decoding, FlashAttention
Throughput-sensitive (batch processing)	Continuous batching, PagedAttention
Long sequences (>4K tokens)	PagedAttention, FlashAttention-2
Multi-GPU serving	Tensor parallelism + continuous batching

Key Sources

flash-attention-fast-and-memory-efficient-exact-attention — FlashAttention; IO-aware tiling
flash-attention-2 — FlashAttention-2; better GPU utilization
splitwise-llm-inference-phase-splitting — phase disaggregation for LLM serving
pagedattention-vllm — PagedAttention and continuous batching in vLLM
speculative-decoding — speculative decoding; verification parallelism
gqa-grouped-query-attention — GQA; reducing KV cache size without quality loss
llama-open-efficient-foundation-language-models — LLaMA; practical decisions combining GQA, RoPE, and efficient attention
mamba-linear-time-sequence-modeling — Mamba/SSM; an alternative architecture that sidesteps the KV cache problem
mixtral-of-experts
switch-transformer-sparse-mixture-of-experts
knowledge-distillation-hinton
mistral-7b
mixture-of-depths-dynamic-compute-allocation
qlora-efficient-finetuning-quantized-llms
awq-activation-aware-weight-quantization — AWQ; activation-aware per-channel scaling for W4A16 quantization; 3–4× inference speedup via TinyChat
training-compute-optimal-large-language-models — Chinchilla; shows that smaller compute-optimal models have 4× lower inference cost than undertrained large models; motivates inference-optimal overtraining strategy
gptq-accurate-post-training-quantization

flash-attention — IO-aware attention kernel
kv-cache — the memory bottleneck driving most inference optimizations
speculative-decoding — latency reduction via draft-verify parallelism
transformer — the architecture all these techniques optimize
ssm-mamba — linear-time alternative that avoids the quadratic attention bottleneck

Open Questions

Optimal speculative decoding: how to match draft model to verifier without expensive search
FP8 vs INT4 at scale: when activation quantization is safe
Whether SSMs (Mamba) can fully replace KV-cached Transformers for long-context serving
Serving under heterogeneous hardware constraints (consumer GPUs, NPUs, mobile)

ML Wiki

Explorer

Inference Efficiency

What It Is

Why It Matters

Latency vs. Throughput

Full Taxonomy of Techniques

1. KV Cache

2. FlashAttention

3. Speculative Decoding

4. Continuous Batching

5. PagedAttention (vLLM)

6. Quantization

7. Grouped-Query Attention (GQA)

8. Tensor Parallelism + Pipeline Parallelism

When to Use Each Technique

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Inference Efficiency

What It Is

Why It Matters

Latency vs. Throughput

Full Taxonomy of Techniques

1. KV Cache

2. FlashAttention

3. Speculative Decoding

4. Continuous Batching

5. PagedAttention (vLLM)

6. Quantization

7. Grouped-Query Attention (GQA)

8. Tensor Parallelism + Pipeline Parallelism

When to Use Each Technique

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks