Concepts: diffusion-models | vision-transformer | scaling-laws | latent-space | patch-embeddings Builds on: ddpm-denoising-diffusion-probabilistic-models | latent-diffusion-models-high-resolution-image-synthesis | an-image-is-worth-16x16-words Leads to: Sora (video generation), PixArt-α, Stable Diffusion 3

By 2022, diffusion models had taken over image generation. But they all used the same backbone: the U-Net — a CNN with carefully placed skip connections between encoder and decoder layers. Transformers had displaced CNNs in vision recognition, language, and multimodal tasks. Nobody had seriously asked whether they could displace CNNs in generation too.

William Peebles and Saining Xie asked that question. The answer became DiT.

The core idea

The analogy: A U-Net is like a city’s road network — carefully engineered highways between specific districts (skip connections), designed for how traffic typically flows between resolutions. A Vision Transformer is a city where any district can reach any other district directly, regardless of spatial proximity. Nobody knew if direct flights would work as well as dedicated highways for image generation. They do.

DiT takes the latent diffusion pipeline — where diffusion runs in the compressed latent space of a pre-trained VAE instead of pixel space — and swaps the U-Net for a Vision Transformer. The noisy latent gets divided into small spatial patches, each patch becomes a token, and a standard transformer denoises those tokens. The U-Net’s multi-resolution feature maps, skip connections, and inductive spatial biases are all gone.

The tricky part: transformers don’t natively understand diffusion timesteps or class labels. You need to tell the model both “this is 20% noisy” and “this should be a golden retriever.” Four approaches were tested. Only one worked well.

The mechanism, step by step:

  1. Encode a 256×256 image with a pre-trained VAE → 32×32×4 latent
  2. Add Gaussian noise at timestep → noisy latent
  3. Patchify: divide the 32×32 spatial latent into patches ( = 2, 4, or 8). For : patches, each a -dim vector, linearly projected to -dim tokens
  4. Add 2D sinusoidal position embeddings
  5. Pass through transformer blocks with adaLN-Zero conditioning
  6. Final layer norm + linear projection → recover the predicted noise in latent space
  7. Optimize the same DDPM objective: MSE between predicted and actual noise

The four conditioning strategies tested:

1. In-context:     [token₁, ..., token₂₅₆, t_embed, c_embed]
                   Append timestep + class as extra tokens.
                   Dilutes capacity. Works, but poorly.

2. Cross-attention: Dedicated cross-attention layer per block to
                   attend to conditioning tokens.
                   Adds parameters. Mediocre results.

3. adaLN:          Regress γ, β from (t + c) → modulate LayerNorm.
                   Much better. Initialization still matters.

4. adaLN-Zero:     Same as adaLN, plus regress a gate α per block.
                   Initialize α = 0 → each block starts as identity.
                   Best results. Final model uses this.

adaLN-Zero predicts just 6 vectors per block — — from the conditioning signal. The gate initializes to zero, so at the start of training every transformer block outputs zero and the residual connection passes the input unchanged. Effectively a 0-layer network that adds depth gradually as grows during training.

The math, translated:

The training objective is unchanged from DDPM:

Predict the noise that was added to produce . MSE. The only change is that is a transformer instead of a U-Net.

Within each block, adaptive layer norm:

where and are regressed by a small MLP from the sum of timestep and class embeddings. Layer norm first, then condition-dependent scale and shift.

The adaLN-Zero gate on the residual:

with initialized to zero. Each block starts as a no-op and earns its way into the computation.

Walkthrough with actual numbers:

Input image: 256×256 RGB
After VAE: 32×32×4 latent  (4× spatial compression)

Add noise at t=500 (midway through 1000 steps):
  α̅_500 ≈ 0.51, so x_500 ≈ √0.51 · x_0 + √0.49 · ε
  Roughly half signal, half noise.

Patchify (p=2, DiT-XL model):
  32×32 → 256 patches of 2×2×4 pixels
  Each patch: 16 floats → linear layer → 1152-dim token
  Sequence: 256 tokens of 1152 dims each

Timestep embedding:
  t=500 → sinusoidal frequencies → MLP → 1152-dim

Class embedding:
  class=207 (golden retriever) → embedding lookup → 1152-dim

Conditioning vector:
  c = MLP(t_embed + class_embed)  [1152-dim, same for all tokens]

Per block (28 total in DiT-XL):
  c → linear → 6 vectors of 1152-dim each: γ₁, β₁, α₁, γ₂, β₂, α₂

  Self-attention sub-block:
    h_norm = LayerNorm(h)
    h_cond = γ₁ · h_norm + β₁       (scale/shift conditioned on t,c)
    attn_out = MultiHeadAttention(h_cond)
    h = h + α₁ · attn_out            (gated residual)

  MLP sub-block:
    h_norm = LayerNorm(h)
    h_cond = γ₂ · h_norm + β₂
    mlp_out = MLP(h_cond)
    h = h + α₂ · mlp_out

After 28 blocks: 256 tokens of 1152-dim
Linear decode: 256 × (2×2×4) → 32×32×4 predicted noise

What’s clever — find the instinct:

The key question Peebles and Xie had to answer was not “will a transformer work?” but “will it scale?” U-Nets have hard-coded architectural decisions — skip connection patterns, specific multi-resolution stages — with no clean “make it twice as big” operation. You can widen channels or add blocks, but there’s no smooth scaling axis.

Transformers have exactly that: scale depth (more blocks) and width (larger ). The language modeling literature had already shown this produces predictable improvements. The question was whether the same scaling behavior would hold for diffusion.

“We find that DiTs with higher Gflops — through increased transformer depth/width or increased number of input tokens — consistently have lower FID.”

FID follows a smooth power law with GFLOPs. You can predict how good your model will be before training it. That’s what separates an architecture from a trick.

“Although it has a huge impact on Gflops, note that patch size does not have a meaningful effect on model parameter counts.”

This is the most interesting lever: halving the patch size quadruples the token count, quadruples the GFLOPs, and dramatically improves FID — without changing the number of parameters. Compute, not parameters, drives generation quality.

The adaLN-Zero initialization is a quiet detail with large implications. Setting all residual gates to zero at initialization means the early training landscape is flat and stable — no deep residual paths to cause gradient issues. The model adds “effective depth” gradually. This connects to how good residual networks have always worked: zero-initialize the last layer of each residual block.

“Simply changing the mechanism for injecting conditional inputs makes a huge difference in terms of FID.”

Among the four conditioning strategies, adaLN-Zero beats in-context conditioning by a wide margin at similar GFLOPs — the conditioning interface is as important as the architecture.

Does it actually work? What breaks?

ModelResolutionFID-50KGFLOPsvs Prior Best
DiT-XL/2256×2562.27119vs LDM-4: 3.60 (−37%)
DiT-XL/2512×5123.04525vs ADM-U: 3.85 at 2813 GFLOPs
ADM-U (U-Net)256×2563.94742
LDM-4 (U-Net)256×2563.60103
DiT-XL/8256×256~23~15Same params, 8× less compute

DiT-XL/2 achieves FID 2.27 on ImageNet 256×256, beating every prior diffusion model. At 512×512, it improves FID from 3.85 (ADM-U) to 3.04 using 525 GFLOPs — versus ADM-U’s 2813 GFLOPs. Five times more compute-efficient.

The XL/8 row tells the real story: same parameter count as XL/2, but 8× fewer tokens and ~8× fewer GFLOPs, giving FID ~23 vs 2.27. Parameters aren’t the bottleneck — compute is.

Ablation: conditioning strategy

ConditioningFID at 400K steps
adaLN-Zerobest
adaLN2nd
Cross-attention3rd
In-contextworst

What doesn’t work:

DiT trains for 7 million steps on ImageNet with significant compute. The paper doesn’t compare U-Net vs DiT at low compute budgets, so the crossover point is unknown. For constrained compute, a U-Net might still be competitive.

The results are class-conditional only. Text-conditional generation — the dominant use case — requires a text encoder and cross-attention conditioning. DiT doesn’t address this; PixArt-α, Stable Diffusion 3, and Sora extend the architecture with text conditioning in follow-up work.

So what?

If you’re building image or video generation systems, DiT’s message is: the architecture bottleneck has moved from “what inductive biases should I engineer” to “how much compute can I allocate.” The U-Net’s multi-resolution structure was load-bearing for years. It isn’t essential — a clean transformer with adaLN-Zero conditioning and enough tokens matches or beats it, and scales predictably as you add more.

Practical checklist: use adaLN-Zero for timestep/class conditioning. Choose patch size based on compute budget — smaller patches are better but more expensive. Budget for long training runs; the scaling behavior requires significant steps to manifest.

DiT sits at the intersection of two threads: the scaling-laws observed for language transformers, and the diffusion-models framework that had dominated image generation since DDPM. Where LDM brought diffusion into latent space, DiT brought scaling laws into diffusion. The result is a generative architecture that doesn’t plateau — it improves predictably with compute. That’s why Sora and the next generation of video models adopted the DiT backbone.

Swap U-Net for ViT, condition with adaLN-Zero, and FID scales as a power law of compute.

Connections

Citation

arXiv:2212.09748

Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023 (Oral). https://arxiv.org/abs/2212.09748