Mixture of Depths: Dynamically Allocating Compute in Transformer LLMs

Concepts: transformer | mixture-of-experts | inference-efficiency | dynamic-computation Builds on: attention-is-all-you-need | switch-transformer-sparse-mixture-of-experts Leads to: Mixture-of-Depths-and-Experts (MoDE) — integrating depth-routing with width-routing for compounding gains

“Not all problems require the same amount of time or effort to solve.” That’s the opening observation of the Mixture of Depths paper — obvious when you hear it, yet transformer language models do the opposite: every token, in every layer, gets the same compute. The comma after “however” runs through 24 layers of attention. So does the word that just changed the meaning of everything. Mixture of Depths (MoD) fixes this by letting the network learn when to skip.

The problem

“Transformer-based language models spread FLOPs uniformly across input sequences.” That uniformity is the inefficiency. At 2048 tokens and 24 layers, every forward pass does the same 24 × 2048² operations — regardless of whether the token is “the” in a boilerplate phrase or the crucial word in a key sentence. The network is burning compute to reach the same conclusion millions of times: this token doesn’t change much at this layer.

The challenge is making this dynamic without breaking GPU utilization. Most conditional computation approaches (early exiting, adaptive depth) produce variable-size tensors — hardware hates that. MoD’s core insight: you can fix the compute budget statically while making the allocation dynamic.

The core idea

The analogy. Think of a hospital triage nurse. When patients arrive, she doesn’t give everyone a full workup — blood draw, MRI, ECG, the works. She routes. The paper cut gets a bandage. The chest pain gets the full protocol. The hospital’s total resources are fixed, but how they’re spent is dynamic. MoD applies this to transformer layers: each “routing block” looks at every token, scores them with a cheap router, and sends only the most-needing tokens through the expensive attention + MLP computation. The rest skip the block entirely via a residual connection.

The mechanism, step by step:

Alternate block types: one full-attention block (all tokens, every time), then one routing block (only top-k tokens), then full again, and so on. The full blocks ensure all tokens can always communicate — the routing blocks are where compute gets selectively spent.
At each routing block, a router scores every token with a single scalar: $r_{i}^{l} = w_{θ}^{T} x_{i}^{l}$ . One dot product per token. Essentially free.
The top- $k$ tokens by router weight get selected. At 12.5% capacity with 2048 tokens, $k = 256$ .
Those 256 tokens run through self-attention (attending only to each other) and the MLP. The remaining 1,792 tokens bypass the block entirely.
Block output for a selected token: $r_{i}^{l} \cdot f_{i} (\tilde{X}^{l}) + x_{i}^{l}$ . The router weight multiplies the block’s contribution — putting it on the gradient path so it trains via backprop.
Block output for a bypassed token: $x_{i}^{l + 1} = x_{i}^{l}$ . No compute. No change.

VANILLA TRANSFORMER BLOCK:
┌──────────────────────────────────────────────┐
│ 2048 tokens                                  │
│   → Attention (2048 × 2048 = 4.2M ops)      │
│   → MLP (2048 tokens)                        │
│   → 2048 tokens out                          │
└──────────────────────────────────────────────┘

MoD ROUTING BLOCK (k=256, 12.5% capacity):
┌──────────────────────────────────────────────────────────┐
│ 2048 tokens                                              │
│   → Router (2048 dot products, ~free)                    │
│       Top 256 → Attention (256×256 = 65K ops) → MLP     │
│       Rest 1792 → residual (pass through, zero compute)  │
│   → 2048 tokens out (selected updated, rest unchanged)   │
└──────────────────────────────────────────────────────────┘

ATTENTION FLOP COMPARISON:
  Vanilla routing block:  2048² = 4,194,304 ops
  MoD routing block:       256² =    65,536 ops   (1.6% of vanilla)
  
With alternating blocks (half routing, half full):
  Average attention FLOPs ≈ 50% × full + 50% × 1.6% full ≈ 51% of vanilla

The key equation:

$x_{i}^{l + 1} = {r_{i}^{l} \cdot f_{i} (\tilde{X}^{l}) + x_{i}^{l} x_{i}^{l} if r_{i}^{l} > P_{1 - C / S} (R^{l}) if r_{i}^{l} < P_{1 - C / S} (R^{l})$

Where:

$r_{i}^{l} = w_{θ}^{T} x_{i}^{l}$ — the router scalar for token $i$ at layer $l$
$P_{1 - C / S} (R^{l})$ — the $(1 - C / S)$ -th percentile of all router weights (the threshold separating top- $k$ from the rest)
$\tilde{X}^{l}$ — the set of top- $k$ selected tokens
$f$ — the block’s full computation (self-attention + MLP)
$C$ — capacity (user-defined number of tokens to process per batch element), $S$ — total sequence length

Numeric walkthrough. Eight tokens, $k = 2$ (25% capacity for illustration):

Tokens:         ["the", "cat",  "sat",  "suddenly", "on", "the", "mat", "."]
Router weights: [ 0.31,  0.82,   0.14,      0.91,  0.63,  0.22,  0.74,  0.41]

Threshold = P_0.75(R) = 0.74  (75th percentile — top 25% selected)
Selected:  "suddenly" (0.91), "cat" (0.82)
Bypassed:  all others → x_i^(l+1) = x_i^l, zero compute spent

Attention runs on 2 tokens only (2×2 instead of 8×8):
  Simplified Q, K vectors (dim=3):
    Q_suddenly = [0.3, 0.9, 0.1]    K_cat      = [0.7, 0.2, 0.8]
    Q_cat      = [0.5, 0.4, 0.7]    K_suddenly = [0.2, 0.8, 0.3]

  Score(suddenly → cat)      = (0.3×0.7 + 0.9×0.2 + 0.1×0.8) / √3 = 0.42 / 1.73 ≈ 0.24
  Score(suddenly → suddenly) = (0.3×0.2 + 0.9×0.8 + 0.1×0.3) / √3 = 0.78 / 1.73 ≈ 0.45

  Softmax: [0.44, 0.56]
  → "suddenly" attends 44% to "cat", 56% to itself

Block output for "suddenly" (router weight = 0.91):
  x_suddenly^(l+1) = 0.91 × attn_output + x_suddenly
  The 0.91 scaling puts the router on the gradient path.

Block output for "the" (bypassed, router weight = 0.31):
  x_the^(l+1) = x_the   ← zero computation

“The” is a function word with low predictive entropy — it’s rarely the crux. “Suddenly” shifts the narrative. The router learned to distinguish them not from any label, but from the language modeling objective alone.

What’s clever. Most dynamic computation methods fail in practice because variable-size tensors are hostile to GPU utilization. MoD’s design sidesteps this: “compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level.” You fix $k$ before training. Tensor shapes are always $(k, d)$ for routing blocks. The routing decisions are dynamic; the hardware cost is not.

The second trick is gradient routing through the router weight. The output for selected tokens is $r_{i}^{l} \cdot f_{i} (\tilde{X}^{l}) + x_{i}^{l}$ — the router scalar multiplies into the block’s contribution. If the attention + MLP computation for a given token is useless, the gradient will push $r_{i}^{l}$ down, shrinking that token’s contribution. The network has a financial incentive to route accurately.

“Preliminary analyses suggest that the tokens that engage with blocks more frequently are correlated with output predictions that have higher entropy, which possibly corresponds to predictions that are more difficult to make.” The network discovered this policy. Nobody told it to.

Results

Variant	FLOPs / forward pass	vs. isoFLOP baseline	Step speed
MoD 220M, parity setting	~40% of baseline	Equal loss	60–66% faster
MoD (isoFLOP optimal, larger)	~50% of baseline	1.5% better log prob	~same wall-clock
MoDE (integrated MoD + MoE)	~50% of baseline	Better than MoD or MoE alone	—

The parity result is the headline: a 220M MoD model matches the loss of a 220M vanilla transformer while stepping 60–66% faster. Wall-clock training time stays the same because you can afford a slightly larger model with the saved FLOPs. But inference is permanently faster — every forward pass costs 40% of what the baseline would.

“It seems the network is robust to significant capacity reductions as long as there is frequent opportunity for full capacity self-attention and MLP computations.” The alternating pattern is load-bearing: routing blocks can safely skip 87.5% of tokens because the full-attention blocks between them let all tokens resynchronize every other layer.

What doesn’t work. Stochastic routing — randomly selecting which tokens to process, with no learning — performs “drastically worse than both the baseline and normal MoD transformer.” Learned routing isn’t an option; it’s what makes the whole mechanism work. Below 12.5% capacity, performance starts to degrade. And the top-k operation is non-causal: it selects based on the entire sequence, which is unavailable during autoregressive sampling where you generate one token at a time.

The fix: an auxiliary binary cross-entropy loss trains the router to predict its own top-k decisions causally (above or below threshold). This “auxiliary classifier” achieves 97–99% prediction accuracy early in training, with 0.2–0.3% overhead on the primary loss — small but present. One cannot simply take a trained MoD model and sample from it without this modification.

So what?

If you’re serving LLMs, MoD is a template for isoFLOP Pareto improvement: the same quality at lower inference cost, or better quality at the same cost. The recipe is mechanical — alternate full-attention blocks with routing blocks at 12.5% capacity, tune model depth to keep forward-pass FLOPs equal to your current baseline, add the auxiliary causal predictor. You should see ~50% inference speedup at equal quality.

The deeper pattern here connects to what switch-transformer-sparse-mixture-of-experts does: that paper routes tokens across width (sending each token to one of N expert MLPs). MoD routes tokens across depth (sending each token through a subset of layers). “One just needs to tune the model size for a given MoD configuration to produce a model that uses as many FLOPs per forward pass as the isoFLOP-optimal baseline, and they will have the optimally performing MoD variant for that configuration.” The principle generalizes: compute allocation is a design knob, learned routing is how you turn it, and the network itself knows which tokens need which kind of processing.

A router that costs 2,048 dot products per layer saves 98% of self-attention FLOPs for tokens that don’t need it — and the network learned which tokens those are, no supervision required.

Paper: Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models — Raposo, Rabe, Rae, Jaques, Elsen, Lillicrap, Santoro (Google DeepMind) — 2024

Connections

dynamic-computation — MoD introduces learned token-level depth routing as a form of dynamic compute allocation in transformers
mixture-of-experts — MoD borrows the top-k routing mechanism from MoE but routes across depth (skip vs. process) instead of width (which expert)
inference-efficiency — main contribution: 50%+ forward-pass FLOP reduction at equal or better quality
transformer — the base architecture MoD extends with alternating routing blocks
switch-transformer-sparse-mixture-of-experts — the MoE paper MoD draws routing inspiration from; MoDE combines both approaches
attention-is-all-you-need — the transformer architecture that MoD modifies

Citation

arXiv:2404.02258

Raposo, D., Ritter, S., Richards, B., Lillicrap, T., Humphreys, P. C., & Santoro, A. (2024). Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models. arXiv preprint. https://arxiv.org/abs/2404.02258

ML Wiki

Explorer