The Problem
Some models do not fit on a single GPU. A 70B parameter model in fp16 needs 140 GB just for weights. Per-GPU memory caps out at 80 GB on A100 / H100. Even with mixed precision and gradient checkpointing, you cannot replicate the full model on each GPU. Data parallel does not help; you need to split the model itself.
The Key Insight
Split the model across multiple GPUs so that each GPU holds only a fraction. There are two ways to split:
- Tensor parallel (intra-layer). Within a single layer, split the matrix multiplications across GPUs. Each GPU computes part of each layer’s output.
- Pipeline parallel (inter-layer). Assign different layers to different GPUs. GPU 0 has layers 1-8, GPU 1 has layers 9-16, etc.
These are orthogonal and combinable; modern systems use both.
Mechanism in Plain English
Tensor Parallel:
- A linear layer Y = X @ W with W of shape [in, out] gets split: each GPU holds W[:, k:k+m] for some slice.
- Each GPU computes a slice of the output: Y_k = X @ W_slice.
- An all-reduce or all-gather combines outputs back into the full Y.
- Communication: every layer; bandwidth-heavy. Used within a node where NVLink is fast.
Pipeline Parallel:
- Layers are split into stages. Each stage lives on its own GPU(s).
- A micro-batch goes forward through stage 0 → stage 1 → … → stage N.
- While stage N processes micro-batch 0, stage 0 can start micro-batch 1.
- This is a literal pipeline: depth d, latency proportional to d, throughput sustained.
- Bubbles: at the start, only stage 0 is busy; at the end, only stage N. Wasted cycles.
- Communication: only at stage boundaries; per-stage size = activation size.
ASCII Diagram
Tensor Parallel (within a node, NVLink):
X (full input, replicated)
|
+--------+--------+
| | |
GPU 0 GPU 1 GPU 2
W[:,0:m] W[:,m:2m] W[:,2m:3m]
| | |
Y_0 Y_1 Y_2 (each is a slice of full Y)
| | |
+--all-reduce/all-gather--+
|
Y (full output, replicated)
Pipeline Parallel (across nodes, IB):
Time: t1 t2 t3 t4 t5 t6
GPU 0 [m0] [m1] [m2] [m3] [m4] [m5] <- micro-batches
through stage 0
GPU 1 [m0] [m1] [m2] [m3] [m4] <- through stage 1
GPU 2 [m0] [m1] [m2] [m3] <- through stage 2
GPU 3 [m0] [m1] [m2] <- through stage 3
(output)
Bubbles: the initial idle cells (e.g., GPU 1 at t1).
What’s Clever
The two model-parallel modes target different bottlenecks:
- TP attacks “single layer too large.” If a matrix multiply involves a 25K x 25K weight matrix, TP can split it across 8 GPUs giving each a manageable 25K x 3125 slice.
- PP attacks “total model too large.” 80 layers split across 8 stages = 10 layers per GPU; each GPU’s memory holds only 1/8 of the model.
Pipeline bubbles can be reduced (1F1B schedule, Megatron’s interleaved schedule, GPipe with smaller micro-batches), but never eliminated; PP always has some efficiency penalty.
TP communication scales with activation size and is per-layer; PP communication scales with activation size and is per-stage-boundary. PP needs less aggregate bandwidth.
Concrete Walkthrough
70B model on 16 GPUs:
Option A: pure DP
Per-GPU memory needed: 70B * 2 * 5 (params, grads, opt states with Adam) ≈ 700 GB.
No way it fits.
Option B: ZeRO-3 (FSDP), 16-way
Per-GPU memory: 700 / 16 ≈ 44 GB.
Plus all-gather buffer (size of largest layer): ~2 GB.
Plus activations: ~10 GB with checkpointing.
Total: ~56 GB. Fits in 80 GB.
Option C: TP=4 + DP=4
Each layer's weights split 4 ways: per-GPU param memory = 70B * 2 / 4 = 35 GB.
Then DP=4 for the four TP groups: same memory, just more replicas of the splits.
Per-GPU optimizer (fp32 Adam) = 70B * 12 / 4 = 210 GB. Too much without state sharding.
Add ZeRO-1: optimizer / 4 = 53 GB.
Per-GPU total: 35 (params, fp16) + 35 (grads) + 53 (opt) + ~10 (activations) = 133 GB. Still too much.
Option D: TP=4 + PP=2 + DP=2
Per-GPU params: 70B * 2 / (4 * 2) = 17.5 GB.
Per-GPU opt: 70B * 12 / (4 * 2) = 105 GB. With ZeRO-1 over the DP=2: 52.5 GB. Hmm.
Need to push further: TP=4, PP=4, DP=1.
Per-GPU params: 70B * 2 / (4 * 4) = 8.75 GB.
Optimizer (no DP): 70B * 12 / (4 * 4) = 52.5 GB.
Activations: ~10 GB per stage with checkpointing.
Total: 8.75 + 8.75 (grads) + 52.5 + 10 = 80 GB. Tight but fits.
Real systems trade off: usually TP=8 within a node, PP=multiple-stages across nodes, DP for replication. Plus FSDP for state sharding.
Key Sources
- megatron-lm-training-multi-billion-parameter-language-models — TP foundation paper
- zero-memory-optimizations-trillion-parameter-models — orthogonal axis (state sharding)
- pytorch-fsdp-fully-sharded-data-parallel — modern open implementation
Related Concepts
- distributed-training — broader category
- tensor-parallel — intra-layer axis
- data-parallel — orthogonal data axis
- memory-efficiency — driving constraint
Open Questions
- Optimal placement. Given a model and a cluster, what TP/PP/DP/sharding split minimizes wall-clock time? Manual today; partial automation in DeepSpeed AutoTuning, Alpa, PyTorch DTensor.
- Pipeline scheduling. 1F1B, interleaved, ZB-PP all reduce bubbles in different ways; the tradeoff space is large.
- Heterogeneous parallelism. Different layers may benefit from different splits. Some libraries (Alpa) automate this.