ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Concepts: distributed-training | data-parallel | model-parallel | memory-efficiency | mixed-precision-training

In 2019, training a 100B-parameter language model meant either model parallelism (split layers across GPUs, with all the pipeline-bubble pain that entails) or running out of memory. Data parallelism, the simpler approach, replicated everything: model weights, gradients, and optimizer states all lived in full copies on every GPU. ZeRO observes that the replication is wasteful, partitions everything, and turns data parallelism into something that scales to trillion-parameter models.

The core idea

The analogy: A construction crew of 100 workers, each carrying their own complete copy of the blueprints, the tools, and the project notebook. Each worker only ever uses one section of the blueprint at a time. Most of what they are carrying is dead weight. ZeRO’s insight: each worker carries only their assigned section, and when they need to consult someone else’s section, they call over and ask. The total weight carried by all workers shrinks roughly 100x.

Standard data-parallel training stores three things per GPU:

Model weights (parameters): say 16 bytes per parameter (fp32 master + fp16 copy).
Gradients: 4 bytes per parameter (fp16 typically, sometimes fp32).
Optimizer states (Adam: first and second moments, in fp32): about 12 bytes per parameter.

Total: roughly 16 bytes per parameter for weights + grads + a hefty 12 for optimizer = ~32 bytes per parameter, replicated on every GPU.

For a 7.5B model: 7.5B * 32 bytes = 240 GB per GPU. Even with 16-bit, that’s 90+ GB just for the static state, before activations. No GPU has that much memory.

“ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices.”

ZeRO partitions these three across the data-parallel group, in three optional stages:

ZeRO-1: partition optimizer states only. (~4x savings on optimizer state, total ~4x.)
ZeRO-2: partition optimizer states + gradients. (~8x savings on combined state.)
ZeRO-3: partition optimizer states + gradients + parameters. (~Nx savings, where N is the number of data-parallel ranks.)

What’s clever — find the instinct

The non-obvious move is the realization that most of what data-parallel training stores is unnecessary at any given moment. During the forward pass, you only need the parameters of the current layer. During the backward pass, you only need the gradients of the current layer. During the optimizer step, you only need the parameters and optimizer states of the parameters you currently own.

If a parameter “lives” on rank 17, then ranks 0..16, 18..N can fetch it from rank 17 right before the forward pass through that layer, use it, then drop it. After the backward pass, the gradient gets reduced into rank 17 (which owns it). Only rank 17 keeps the optimizer state for that parameter and runs the Adam update locally.

This is in essence: data parallelism on the outside (each GPU sees a different micro-batch), parameter sharding on the inside (each parameter has one canonical owner who runs the update). Communication-wise, ZeRO-3 is roughly equivalent to standard data-parallel: the same bytes move, just in a different order (all-gather of parameters before each forward, reduce-scatter of gradients after each backward).

“Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware.”

The second clever move: ZeRO is fully orthogonal to model parallelism (Megatron-style tensor parallel + pipeline parallel). You can compose it. Megatron handles intra-layer parallelism for very large layers; ZeRO handles state partitioning across the data dimension. The combined system (today shipped as DeepSpeed) trains models that neither approach handles alone.

Walkthrough: 7.5B model on 64 GPUs

Setup: GPT-style 7.5B parameter model. Adam optimizer.
       Mixed-precision training: fp16 forward/backward, fp32 master + Adam states.
       64-way data parallelism. No model parallelism.

Standard DDP memory per GPU:
  fp16 params:         7.5e9 * 2  =  15 GB
  fp32 master params:  7.5e9 * 4  =  30 GB
  fp16 gradients:      7.5e9 * 2  =  15 GB
  Adam m (fp32):       7.5e9 * 4  =  30 GB
  Adam v (fp32):       7.5e9 * 4  =  30 GB
  TOTAL:                          120 GB per GPU

  An A100 has 80 GB. Out of memory by 50%.

With ZeRO-1 (partition optimizer states across 64 GPUs):
  fp16 params:         15 GB
  fp32 master params:  30 / 64 = 0.47 GB
  fp16 gradients:      15 GB
  Adam m (fp32):       30 / 64 = 0.47 GB
  Adam v (fp32):       30 / 64 = 0.47 GB
  TOTAL:               31.4 GB per GPU
  Speedup: 4x reduction.

With ZeRO-2 (also partition gradients):
  fp16 params:         15 GB
  fp32 master params:  0.47 GB
  fp16 gradients:      15 / 64 = 0.23 GB
  Adam m (fp32):       0.47 GB
  Adam v (fp32):       0.47 GB
  TOTAL:               16.6 GB per GPU
  Speedup: 7x reduction.

With ZeRO-3 (also partition parameters):
  fp16 params:         15 / 64 = 0.23 GB  (live shard)
  fp32 master params:  0.47 GB
  fp16 gradients:      0.23 GB
  Adam m, v:           0.47 + 0.47 GB
  TEMPORARY all-gather buffer: ~ size of largest layer
                       ≈ 1-2 GB
  TOTAL:               ~3 GB per GPU  (steady state)
  Speedup: 40x+.

The communication cost trade-off: ZeRO-1 and ZeRO-2 add nothing over standard DDP (the all-reduce on gradients was already there, just becomes a reduce-scatter in ZeRO-2). ZeRO-3 adds an all-gather of parameters before each forward and backward, which is roughly 1.5x the communication of DDP — but the same total bytes you’d send anyway in any model-parallel approach.

Does it work? What breaks?

Headline numbers from the paper:

Configuration	Max model size	Throughput
Standard data-parallel (400 V100s)	1.4B parameters	baseline
Megatron model-parallel	8.3B parameters	(with significant code changes)
ZeRO-1 (DP only)	13B parameters	super-linear scaling to 100B+
ZeRO-2 + Megatron MP	100B+ parameters	15 PetaFLOPs on 400 V100s

The “super-linear” claim deserves explanation: as you add more GPUs, ZeRO frees memory faster than it adds workers. This lets you fit a larger micro-batch per GPU, which improves arithmetic intensity, which actually speeds up per-token throughput beyond the linear baseline. This is rare in distributed systems.

“ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply.”

The simplification matters: a researcher with a working data-parallel training loop can flip ZeRO on and immediately train a 10x larger model. No code rewrite for tensor-parallel comm patterns, no debug of pipeline schedules. The Turing-NLG (17B) model was trained this way.

What breaks:

ZeRO-3 has lower throughput than ZeRO-2 due to the parameter all-gather. For models that fit with ZeRO-2, do not use ZeRO-3.
Communication-bound workloads (small batch, fast per-step compute, slow interconnect) struggle: NVLink and InfiniBand make ZeRO competitive; commodity Ethernet can be a bottleneck.
Variable-length sequences (long-context training) cause the all-gather buffers to be sized for worst case, wasting memory.
Activation memory is not addressed by ZeRO directly. ZeRO + activation checkpointing are a common pair.
Gradient accumulation interacts subtly: with ZeRO-2/3, gradients are reduce-scattered each accumulation step (or buffered), which costs communication.

So what?

For a practitioner training models that don’t fit on one GPU:

Always start with ZeRO-1. It is a free lunch: no communication overhead vs. DDP, gives 4x optimizer-state savings, takes one config flag in DeepSpeed or PyTorch FSDP.
Move to ZeRO-2 if you need more memory. Same communication cost as ZeRO-1; partitions gradients too. Gives 8x state savings.
Use ZeRO-3 only when you cannot fit a single full copy of the parameters. Its all-gather cost is real; workloads that fit with ZeRO-2 should not pay it.
Combine with model parallel for >100B models. ZeRO is the data-axis shard; Megatron handles intra-layer; pipeline handles inter-layer. The 3D combination (ZeRO + tensor + pipeline) is what trains GPT-scale models today.
Use offload sparingly. ZeRO-Infinity / CPU-offload (later ZeRO papers) move shards to CPU memory or NVMe. Useful for fitting truly huge models on small clusters; throughput drops 5-10x.
PyTorch FSDP is the modern open implementation. ZeRO-3 == FSDP with FULL_SHARD. The DeepSpeed and FSDP APIs differ but the algorithm is the same.

For the L5 interview question “how do you train a model bigger than your GPU?”: ZeRO-3 / FSDP is the right first answer. Then qualify with: combine with tensor parallelism for layers larger than per-GPU memory; combine with pipeline parallelism for interconnect-bandwidth-limited setups; combine with activation checkpointing for memory-bound forwards.

Connections

distributed-training — ZeRO is now the de facto data-axis sharding technique
data-parallel — ZeRO is data-parallel-like with partitioned state
model-parallel — composes orthogonally with tensor and pipeline parallelism
memory-efficiency — primary motivation for ZeRO
mixed-precision-training — the fp16/fp32 split is what makes optimizer states the dominant cost
microsoft-research — DeepSpeed team

Citation

arXiv:1910.02054

Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2019). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC 2020. https://arxiv.org/abs/1910.02054

ML Wiki

Explorer