Model Parallel

The Problem

Some models do not fit on a single GPU. A 70B parameter model in fp16 needs 140 GB just for weights. Per-GPU memory caps out at 80 GB on A100 / H100. Even with mixed precision and gradient checkpointing, you cannot replicate the full model on each GPU. Data parallel does not help; you need to split the model itself.

The Key Insight

Split the model across multiple GPUs so that each GPU holds only a fraction. There are two ways to split:

Tensor parallel (intra-layer). Within a single layer, split the matrix multiplications across GPUs. Each GPU computes part of each layer’s output.
Pipeline parallel (inter-layer). Assign different layers to different GPUs. GPU 0 has layers 1-8, GPU 1 has layers 9-16, etc.

These are orthogonal and combinable; modern systems use both.

Mechanism in Plain English

Tensor Parallel:

A linear layer Y = X @ W with W of shape [in, out] gets split: each GPU holds W[:, k:k+m] for some slice.
Each GPU computes a slice of the output: Y_k = X @ W_slice.
An all-reduce or all-gather combines outputs back into the full Y.
Communication: every layer; bandwidth-heavy. Used within a node where NVLink is fast.

Pipeline Parallel:

Layers are split into stages. Each stage lives on its own GPU(s).
A micro-batch goes forward through stage 0 → stage 1 → … → stage N.
While stage N processes micro-batch 0, stage 0 can start micro-batch 1.
This is a literal pipeline: depth d, latency proportional to d, throughput sustained.
Bubbles: at the start, only stage 0 is busy; at the end, only stage N. Wasted cycles.
Communication: only at stage boundaries; per-stage size = activation size.

ASCII Diagram

Tensor Parallel (within a node, NVLink):
                X (full input, replicated)
                |
       +--------+--------+
       |        |        |
      GPU 0   GPU 1    GPU 2
      W[:,0:m]  W[:,m:2m]  W[:,2m:3m]
       |        |        |
      Y_0     Y_1      Y_2 (each is a slice of full Y)
       |        |        |
       +--all-reduce/all-gather--+
                |
                Y (full output, replicated)


Pipeline Parallel (across nodes, IB):

Time:    t1     t2     t3     t4     t5     t6
GPU 0    [m0]   [m1]   [m2]   [m3]   [m4]   [m5]   <- micro-batches
                                                       through stage 0
GPU 1           [m0]   [m1]   [m2]   [m3]   [m4]   <- through stage 1
GPU 2                  [m0]   [m1]   [m2]   [m3]   <- through stage 2
GPU 3                         [m0]   [m1]   [m2]   <- through stage 3
                                                      (output)

Bubbles: the initial idle cells (e.g., GPU 1 at t1).

What’s Clever

The two model-parallel modes target different bottlenecks:

TP attacks “single layer too large.” If a matrix multiply involves a 25K x 25K weight matrix, TP can split it across 8 GPUs giving each a manageable 25K x 3125 slice.
PP attacks “total model too large.” 80 layers split across 8 stages = 10 layers per GPU; each GPU’s memory holds only 1/8 of the model.

Pipeline bubbles can be reduced (1F1B schedule, Megatron’s interleaved schedule, GPipe with smaller micro-batches), but never eliminated; PP always has some efficiency penalty.

TP communication scales with activation size and is per-layer; PP communication scales with activation size and is per-stage-boundary. PP needs less aggregate bandwidth.

Concrete Walkthrough

70B model on 16 GPUs:

Option A: pure DP
  Per-GPU memory needed: 70B * 2 * 5 (params, grads, opt states with Adam) ≈ 700 GB.
  No way it fits.

Option B: ZeRO-3 (FSDP), 16-way
  Per-GPU memory: 700 / 16 ≈ 44 GB.
  Plus all-gather buffer (size of largest layer): ~2 GB.
  Plus activations: ~10 GB with checkpointing.
  Total: ~56 GB. Fits in 80 GB.

Option C: TP=4 + DP=4
  Each layer's weights split 4 ways: per-GPU param memory = 70B * 2 / 4 = 35 GB.
  Then DP=4 for the four TP groups: same memory, just more replicas of the splits.
  Per-GPU optimizer (fp32 Adam) = 70B * 12 / 4 = 210 GB. Too much without state sharding.
  Add ZeRO-1: optimizer / 4 = 53 GB.
  Per-GPU total: 35 (params, fp16) + 35 (grads) + 53 (opt) + ~10 (activations) = 133 GB. Still too much.

Option D: TP=4 + PP=2 + DP=2
  Per-GPU params: 70B * 2 / (4 * 2) = 17.5 GB.
  Per-GPU opt: 70B * 12 / (4 * 2) = 105 GB. With ZeRO-1 over the DP=2: 52.5 GB. Hmm.
  Need to push further: TP=4, PP=4, DP=1.
  Per-GPU params: 70B * 2 / (4 * 4) = 8.75 GB.
  Optimizer (no DP): 70B * 12 / (4 * 4) = 52.5 GB.
  Activations: ~10 GB per stage with checkpointing.
  Total: 8.75 + 8.75 (grads) + 52.5 + 10 = 80 GB. Tight but fits.

Real systems trade off: usually TP=8 within a node, PP=multiple-stages across nodes, DP for replication. Plus FSDP for state sharding.

Key Sources

megatron-lm-training-multi-billion-parameter-language-models — TP foundation paper
zero-memory-optimizations-trillion-parameter-models — orthogonal axis (state sharding)
pytorch-fsdp-fully-sharded-data-parallel — modern open implementation

distributed-training — broader category
tensor-parallel — intra-layer axis
data-parallel — orthogonal data axis
memory-efficiency — driving constraint

Open Questions

Optimal placement. Given a model and a cluster, what TP/PP/DP/sharding split minimizes wall-clock time? Manual today; partial automation in DeepSpeed AutoTuning, Alpa, PyTorch DTensor.
Pipeline scheduling. 1F1B, interleaved, ZB-PP all reduce bubbles in different ways; the tradeoff space is large.
Heterogeneous parallelism. Different layers may benefit from different splits. Some libraries (Alpa) automate this.

ML Wiki

Explorer

Model Parallel

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

What’s Clever

Concrete Walkthrough

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Model Parallel

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

What’s Clever

Concrete Walkthrough

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks