Fast Inference from Transformers via Speculative Decoding

Your GPU can do trillions of arithmetic operations per second. While generating tokens from a large language model, it uses maybe 5% of that capacity. The other 95% sits idle. Every single token generation step, all that hardware is waiting — not for math, but for memory. To generate one token, the model reads its entire weight matrix from memory: tens of gigabytes traveling from DRAM to compute cores. One token. One full memory read. Then another. Then another.

Tokens are generated one at a time because each depends on all previous ones. You can’t predict token 47 until you know token 46. That causal dependency feels like a wall. But it isn’t — not quite. Speculative decoding finds the gap and walks through it.

The core mechanism

Think of how a senior lawyer reviews a contract. A junior associate drafts the whole thing first — fast, cheap, mostly right. The senior reads the draft in one sitting, marks the errors, and approves the clean sections. Total time: far less than if the senior had composed every sentence themselves. The junior’s draft might be wrong in places, but the senior’s review catches everything and the final output is identical to what the senior would have written alone.

Speculative decoding works exactly the same way:

A small, fast “draft” model (say, 77M parameters) generates γ candidate tokens autoregressively. This is cheap — the small model fits in fast memory, and its weights are tiny.
The large target model (say, 11B parameters) takes all γ draft tokens and processes them in one parallel forward pass. It reads its weights from memory exactly once, not γ times. This is the key: the big model can evaluate all γ positions simultaneously because attention is parallel over sequence positions.
For each draft token, compare what the big model would have predicted vs. what the small model actually predicted. Accept matches. Use a correction step for mismatches. Always produce at least one new token.

The subtle genius is step 3. You might think: “accepting only the tokens that match means we just throw away the mismatches and get nothing.” But the correction step is smarter — it produces a token drawn from what the big model wanted, so every pass produces at minimum 1 token (the corrected one) and often produces several.

More importantly: the acceptance-rejection rule is designed so that the final output has exactly the same distribution as if you’d run the big model alone the whole time. Not approximately. Exactly.

“Speculative sampling, and therefore speculative decoding, guarantee an identical output distribution for any choice of approximation model Mq without restriction.”

Translation: this isn’t a quality-speed tradeoff. It’s free speed. The outputs are mathematically indistinguishable from running the big model normally. A unit test comparing outputs character-by-character would pass.

WITHOUT speculative decoding:
  Token 43 → [Big model reads 11B weights] → "The"
  Token 44 → [Big model reads 11B weights] → "cat"
  Token 45 → [Big model reads 11B weights] → "sat"
  Token 46 → [Big model reads 11B weights] → "on"
  = 4 full memory reads of 11B weights

WITH speculative decoding (γ=4):
  Draft phase (small model):
    Token 43' → [77M weights] → "The"
    Token 44' → [77M weights] → "cat"  
    Token 45' → [77M weights] → "sat"
    Token 46' → [77M weights] → "on"
  
  Verify phase (ONE parallel pass of big model):
    [Big model reads 11B weights ONCE, evaluates all 4 positions]
    "The" → p ≥ q → ACCEPT ✓
    "cat" → p ≥ q → ACCEPT ✓
    "sat" → p ≥ q → ACCEPT ✓
    "on"  → p < q → REJECT, resample from p'(x) = norm(max(0, p(x)-q(x)))
  
  Result: 3 tokens accepted + 1 corrected = 4 tokens, only 1 big-model forward pass
  Speedup: 4x fewer big-model passes (minus small model overhead ≈ 3.4x net)

The sampling rule that makes this exact (not approximate):

“To sample x ~ p(x), we instead sample x ~ q(x), keeping it if q(x) ≤ p(x), and in case q(x) > p(x) we reject the sample with probability 1 − p(x)/q(x) and sample x again from an adjusted distribution p’(x) = norm(max(0, p(x) − q(x))) instead.”

Translation in plain English: sample a token x from the small model. If the big model likes that token at least as much as the small model does — keep it. If the big model is more skeptical — flip a weighted coin. Heads: keep it anyway. Tails: throw it away and draw from the residual distribution — the stuff the big model wanted that the small model underweighted. The residual ensures you always get a sample that looks like it came from p, not q.

Walkthrough with actual numbers:

γ = 5 draft tokens. α = 0.8 (the small model agrees with the big model 80% of the time, on average).

Expected number of accepted tokens per big-model call:

E[accepted] = (1 - α^(γ+1)) / (1 - α)
            = (1 - 0.8^6) / (1 - 0.8)
            = (1 - 0.262) / 0.20
            = 0.738 / 0.20
            = 3.69 tokens per call

Compared to baseline: 1 token per big-model call.

If the small model costs c = 0.02 of the big model’s time (T5-Small at 77M vs T5-XXL at 11B), the actual speedup:

Actual speedup = 3.69 / (1 + c × γ)
               = 3.69 / (1 + 0.02 × 5)
               = 3.69 / 1.10
               ≈ 3.35x

The paper demonstrates this on a live demo: a 38-token sentence generated with only 9 serial runs of the target model — versus 38 runs with standard decoding. That’s a 4.2x reduction in big-model calls.

α (acceptance rate)	Expected accepted/call	Speedup (c=0.02, γ=5)
0.5	1.97	1.79x
0.7	2.89	2.63x
0.8	3.69	3.35x
0.9	4.69	4.26x

Even α = 0.5 gives 1.79x. The gains are real even when the draft model is mediocre.

“even trivial unigram and bigram approximations yield non negligible α values. For example, for the case of English to German translation, the bigram model has an α value of 0.2, and since c = 0 in this case, yields a 1.25X speed improvement, which is surprisingly high for this trivial approximation model.”

Translation: you don’t need a brilliant draft model. A model that predicts the next token based on only the previous two words — no attention, no learned representations — still gives 1.25x speedup.

What’s clever — the instinct:

The standard framing of inference is: tokens are sequential, therefore inference is sequential, therefore you can’t parallelize it. The brilliant observation is that this is only true for the big model. The small model can still run sequentially — it’s cheap. But the big model is memory-bandwidth-bound. Its forward pass can handle any batch size for free (up to hardware limits) because arithmetic is cheap; what’s expensive is the single memory read of weights. So: draft sequentially with the small model (costs almost nothing), then exploit the big model’s parallelism for verification (costs one memory read instead of γ).

The insight is that the causal constraint operates at the level of token correctness, not token evaluation. The big model can evaluate all γ positions in parallel because it has access to the full context (including previous draft tokens). The causal constraint says you can’t commit to a token until you’ve verified it — but you can check all of them at once.

“inference from large models is often not bottlenecked on arithmetic operations, but rather on memory bandwidth and communication, so additional computation resources might be available.”

Translation: every server running a large LLM has spare arithmetic capacity sitting idle during inference. Speculative decoding is essentially a way to trade that idle compute for fewer memory reads.

Does it actually work?

Setup	Task	Speedup vs standard decoding
T5-Small (77M) → T5-XXL (11B)	En→De translation, greedy	3.4x
T5-Small (77M) → T5-XXL (11B)	Summarization, greedy	3.1x
T5-Small (77M) → T5-XXL (11B)	En→De translation, stochastic	2.6x
Bigram model → T5-XXL	En→De translation	1.25x (trivial model!)

All results with identical outputs verified against baseline.

What doesn’t work:

Compute-bound hardware kills the speedup. If your GPU is already running arithmetic at 90% utilization — because you’re batching many concurrent users — there’s no spare compute for the verification pass. Speculative decoding’s gains assume you’re bandwidth-bound, not compute-bound. In high-throughput serving (large batch sizes), the gains shrink toward zero.

High-temperature sampling hurts acceptance rates. Creative generation with temperature = 1.0 or higher makes the small and large model disagree more often (lower α), reducing the speedup. The paper shows greedy decoding gets 3.4x; stochastic sampling gets 2.6x.

Two models to maintain. You now need a matched small-large pair. If the small model uses a different tokenizer than the large model, the acceptance rule breaks down entirely (the token distributions are incomparable).

Fixed γ is suboptimal. The number of draft tokens γ should ideally adapt based on the current acceptance rate — if α is high, draft more; if α is low, draft fewer. The paper uses a fixed γ and acknowledges that dynamic γ could yield “an additional 40-60% increase in performance.”

So what?

If you’re building LLM inference systems, speculative decoding is now a standard component. It’s available in vLLM, HuggingFace’s TGI, and most production serving stacks. Best conditions for using it: single-user or low-batch serving (you need spare compute), a compatible smaller model in the same family (same tokenizer is mandatory, similar architecture preferred), and sequences longer than ~50 tokens (overhead amortizes). You retrain nothing. You change no outputs. You get 2-3x faster responses.

The technique connects directly to the memory-bandwidth insight that FlashAttention also exploits — the observation that GPU arithmetic is abundant and memory reads are the real constraint. FlashAttention attacks this within a single forward pass by reordering operations to minimize memory traffic. Speculative decoding attacks it across time by batching multiple token positions into a single pass. Both papers are fundamentally about the same bottleneck, approached from different angles.

Use a tiny model to guess, let the big model verify them all at once — same outputs, 3x faster, and it’s the idle arithmetic that pays for it.

Connections

speculative-decoding — the technique introduced
kv-cache — KV cache management interacts with speculative decoding
flash-attention — shares the memory-bandwidth-is-the-bottleneck insight
attention — the operation whose parallelism enables verification batching

Citation

arXiv:2211.17192

Leviathan, Y., Kalman, M., & Matias, Y. (2022). Fast Inference from Transformers via Speculative Decoding. ICML 2023 Oral. https://arxiv.org/abs/2211.17192

ML Wiki

Explorer

Fast Inference from Transformers via Speculative Decoding

The core mechanism

Does it actually work?

So what?

Connections

Citation

Graph View

Table of Contents

Backlinks