The Problem

Some models do not fit on a single GPU. A 70B parameter model in fp16 needs 140 GB just for weights. Per-GPU memory caps out at 80 GB on A100 / H100. Even with mixed precision and gradient checkpointing, you cannot replicate the full model on each GPU. Data parallel does not help; you need to split the model itself.

The Key Insight

Split the model across multiple GPUs so that each GPU holds only a fraction. There are two ways to split:

  1. Tensor parallel (intra-layer). Within a single layer, split the matrix multiplications across GPUs. Each GPU computes part of each layer’s output.
  2. Pipeline parallel (inter-layer). Assign different layers to different GPUs. GPU 0 has layers 1-8, GPU 1 has layers 9-16, etc.

These are orthogonal and combinable; modern systems use both.

Mechanism in Plain English

Tensor Parallel:

  • A linear layer Y = X @ W with W of shape [in, out] gets split: each GPU holds W[:, k:k+m] for some slice.
  • Each GPU computes a slice of the output: Y_k = X @ W_slice.
  • An all-reduce or all-gather combines outputs back into the full Y.
  • Communication: every layer; bandwidth-heavy. Used within a node where NVLink is fast.

Pipeline Parallel:

  • Layers are split into stages. Each stage lives on its own GPU(s).
  • A micro-batch goes forward through stage 0 → stage 1 → … → stage N.
  • While stage N processes micro-batch 0, stage 0 can start micro-batch 1.
  • This is a literal pipeline: depth d, latency proportional to d, throughput sustained.
  • Bubbles: at the start, only stage 0 is busy; at the end, only stage N. Wasted cycles.
  • Communication: only at stage boundaries; per-stage size = activation size.

ASCII Diagram

Tensor Parallel (within a node, NVLink):
                X (full input, replicated)
                |
       +--------+--------+
       |        |        |
      GPU 0   GPU 1    GPU 2
      W[:,0:m]  W[:,m:2m]  W[:,2m:3m]
       |        |        |
      Y_0     Y_1      Y_2 (each is a slice of full Y)
       |        |        |
       +--all-reduce/all-gather--+
                |
                Y (full output, replicated)


Pipeline Parallel (across nodes, IB):

Time:    t1     t2     t3     t4     t5     t6
GPU 0    [m0]   [m1]   [m2]   [m3]   [m4]   [m5]   <- micro-batches
                                                       through stage 0
GPU 1           [m0]   [m1]   [m2]   [m3]   [m4]   <- through stage 1
GPU 2                  [m0]   [m1]   [m2]   [m3]   <- through stage 2
GPU 3                         [m0]   [m1]   [m2]   <- through stage 3
                                                      (output)

Bubbles: the initial idle cells (e.g., GPU 1 at t1).

What’s Clever

The two model-parallel modes target different bottlenecks:

  • TP attacks “single layer too large.” If a matrix multiply involves a 25K x 25K weight matrix, TP can split it across 8 GPUs giving each a manageable 25K x 3125 slice.
  • PP attacks “total model too large.” 80 layers split across 8 stages = 10 layers per GPU; each GPU’s memory holds only 1/8 of the model.

Pipeline bubbles can be reduced (1F1B schedule, Megatron’s interleaved schedule, GPipe with smaller micro-batches), but never eliminated; PP always has some efficiency penalty.

TP communication scales with activation size and is per-layer; PP communication scales with activation size and is per-stage-boundary. PP needs less aggregate bandwidth.

Concrete Walkthrough

70B model on 16 GPUs:

Option A: pure DP
  Per-GPU memory needed: 70B * 2 * 5 (params, grads, opt states with Adam) ≈ 700 GB.
  No way it fits.

Option B: ZeRO-3 (FSDP), 16-way
  Per-GPU memory: 700 / 16 ≈ 44 GB.
  Plus all-gather buffer (size of largest layer): ~2 GB.
  Plus activations: ~10 GB with checkpointing.
  Total: ~56 GB. Fits in 80 GB.

Option C: TP=4 + DP=4
  Each layer's weights split 4 ways: per-GPU param memory = 70B * 2 / 4 = 35 GB.
  Then DP=4 for the four TP groups: same memory, just more replicas of the splits.
  Per-GPU optimizer (fp32 Adam) = 70B * 12 / 4 = 210 GB. Too much without state sharding.
  Add ZeRO-1: optimizer / 4 = 53 GB.
  Per-GPU total: 35 (params, fp16) + 35 (grads) + 53 (opt) + ~10 (activations) = 133 GB. Still too much.

Option D: TP=4 + PP=2 + DP=2
  Per-GPU params: 70B * 2 / (4 * 2) = 17.5 GB.
  Per-GPU opt: 70B * 12 / (4 * 2) = 105 GB. With ZeRO-1 over the DP=2: 52.5 GB. Hmm.
  Need to push further: TP=4, PP=4, DP=1.
  Per-GPU params: 70B * 2 / (4 * 4) = 8.75 GB.
  Optimizer (no DP): 70B * 12 / (4 * 4) = 52.5 GB.
  Activations: ~10 GB per stage with checkpointing.
  Total: 8.75 + 8.75 (grads) + 52.5 + 10 = 80 GB. Tight but fits.

Real systems trade off: usually TP=8 within a node, PP=multiple-stages across nodes, DP for replication. Plus FSDP for state sharding.

Key Sources

Open Questions

  • Optimal placement. Given a model and a cluster, what TP/PP/DP/sharding split minimizes wall-clock time? Manual today; partial automation in DeepSpeed AutoTuning, Alpa, PyTorch DTensor.
  • Pipeline scheduling. 1F1B, interleaved, ZB-PP all reduce bubbles in different ways; the tradeoff space is large.
  • Heterogeneous parallelism. Different layers may benefit from different splits. Some libraries (Alpa) automate this.