Dynamic Computation

What It Is

Dynamic computation refers to neural network architectures where the amount of compute spent on a given input is not fixed at design time but varies based on the input’s content. Instead of every token, image patch, or example receiving the same number of operations, the network learns to allocate more compute to “hard” inputs and less to “easy” ones — spending the total budget where it matters most.

Why It Matters

Uniform compute is a structural inefficiency: a language model applies the same 24 layers to “the” as it does to the word that changes the sentence’s meaning. Dynamic computation breaks this coupling. The total FLOP budget can remain fixed (and thus hardware-friendly), but its allocation becomes input-sensitive. The result: the same quality at lower average cost, or better quality at the same cost.

How It Works

The core challenge is making dynamic decisions compatible with hardware that demands static computation graphs and fixed tensor sizes. Approaches differ in what they make dynamic:

Dynamic depth (Mixture of Depths): A router decides per-token, per-layer whether to process the token through the full block or skip it via residual. Total tokens processed per block is fixed (top-k), so tensor shapes stay static.
Dynamic width (Mixture of Experts): A router decides which expert FFN to activate per token. Total tokens per expert is fixed via capacity constraints.
Early exit: A classifier decides when to stop computing (exit at layer N instead of running all layers). Tends to produce variable-length computation graphs — harder for hardware.
Adaptive computation time (ACT): A halting mechanism learns to stop RNN rollouts early. Generalized but hardware-unfriendly.

The static-budget variants (MoD, MoE) are the most practically successful because GPU/TPU utilization depends on predictable tensor shapes. “Unlike other conditional computation techniques, [MoD] uses a static computation graph with known tensor sizes.”

Key Sources

mixture-of-depths-dynamic-compute-allocation — MoD; learned top-k token routing across transformer depth for 50%+ FLOP savings
switch-transformer-sparse-mixture-of-experts — Switch Transformer; MoE with top-1 routing across expert width
mixtral-of-experts — Mixtral; MoE with top-2 routing at production scale

mixture-of-experts — the dominant form of dynamic width computation in LLMs
inference-efficiency — dynamic computation is one of the key levers for reducing inference cost
transformer — the architecture most dynamic computation methods extend

ML Wiki

Explorer

Dynamic Computation

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Dynamic Computation

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks