This path traces how diffusion models went from a principled but slow generative framework to the foundation of modern image synthesis. Each step addresses a core limitation of the previous: the original formulation works but operates in pixel space at high cost; latent diffusion moves the process to a compressed space to make it practical; transformer backbones then replace the U-Net to enable further scaling.
Step 1 — DDPM: Denoising Diffusion Probabilistic Models
ddpm-denoising-diffusion-probabilistic-models
Start with the foundational formulation. DDPM defines a two-process generative model: a fixed forward process that gradually adds Gaussian noise to an image over T steps, and a learned reverse process that denoises step by step. The key insight is that the reverse process can be parameterized as a neural network predicting the noise at each step — and this objective has a clean variational lower bound that is easy to optimize. DDPM produced image quality that matched or exceeded GANs without adversarial training instability, establishing diffusion as a serious generative framework.
Step 2 — Latent Diffusion Models
latent-diffusion-models-high-resolution-image-synthesis
DDPM operates in pixel space. At 512×512 resolution, running 1000 denoising steps on raw pixels is computationally prohibitive. Latent diffusion solves this by first training a VQ-VAE (vector-quantized variational autoencoder) to compress images into a much smaller latent space — typically 8x smaller in each spatial dimension — then running the diffusion process in that latent space. Decoding the final latent back to pixels is a single cheap decoder pass. This is the architecture behind Stable Diffusion: the diffusion model never touches pixels during generation.
Step 3 — DiT: Scalable Diffusion Models with Transformers
dit-scalable-diffusion-models-with-transformers
Latent diffusion uses a U-Net as the denoising backbone. DiT replaces the U-Net with a Vision Transformer: patch the latent into tokens, add conditioning via adaptive layer normalization, and run a standard transformer. The result is a model that scales predictably — more parameters and more compute produce lower FID, following a smooth power law. This is significant because U-Nets don’t scale as cleanly. DiT’s architecture is the backbone of Sora and the next generation of video and image diffusion systems.