What It Is
Dynamic computation refers to neural network architectures where the amount of compute spent on a given input is not fixed at design time but varies based on the input’s content. Instead of every token, image patch, or example receiving the same number of operations, the network learns to allocate more compute to “hard” inputs and less to “easy” ones — spending the total budget where it matters most.
Why It Matters
Uniform compute is a structural inefficiency: a language model applies the same 24 layers to “the” as it does to the word that changes the sentence’s meaning. Dynamic computation breaks this coupling. The total FLOP budget can remain fixed (and thus hardware-friendly), but its allocation becomes input-sensitive. The result: the same quality at lower average cost, or better quality at the same cost.
How It Works
The core challenge is making dynamic decisions compatible with hardware that demands static computation graphs and fixed tensor sizes. Approaches differ in what they make dynamic:
- Dynamic depth (Mixture of Depths): A router decides per-token, per-layer whether to process the token through the full block or skip it via residual. Total tokens processed per block is fixed (top-k), so tensor shapes stay static.
- Dynamic width (Mixture of Experts): A router decides which expert FFN to activate per token. Total tokens per expert is fixed via capacity constraints.
- Early exit: A classifier decides when to stop computing (exit at layer N instead of running all layers). Tends to produce variable-length computation graphs — harder for hardware.
- Adaptive computation time (ACT): A halting mechanism learns to stop RNN rollouts early. Generalized but hardware-unfriendly.
The static-budget variants (MoD, MoE) are the most practically successful because GPU/TPU utilization depends on predictable tensor shapes. “Unlike other conditional computation techniques, [MoD] uses a static computation graph with known tensor sizes.”
Key Sources
- mixture-of-depths-dynamic-compute-allocation — MoD; learned top-k token routing across transformer depth for 50%+ FLOP savings
- switch-transformer-sparse-mixture-of-experts — Switch Transformer; MoE with top-1 routing across expert width
- mixtral-of-experts — Mixtral; MoE with top-2 routing at production scale
Related Concepts
- mixture-of-experts — the dominant form of dynamic width computation in LLMs
- inference-efficiency — dynamic computation is one of the key levers for reducing inference cost
- transformer — the architecture most dynamic computation methods extend