What It Is
Diffusion models are generative models that learn to reverse a gradual noising process. They add Gaussian noise to data over many steps until it becomes pure noise, then train a neural network to denoise — step by step — back to the original. At inference, they start from random noise and iteratively refine.
Why It Matters
Diffusion models are the dominant approach for image, video, and audio generation. They produce higher-quality, more diverse samples than GANs and are more stable to train. DALL-E 3, Stable Diffusion, Sora, and Wan are all diffusion-based.
How It Works
Forward process: Add Gaussian noise over T steps. At step t: x_t = sqrt(α_t) * x_0 + sqrt(1 - α_t) * ε, where ε ~ N(0, I). After enough steps, x_T ≈ pure noise regardless of x_0.
Reverse process: Train a network ε_θ to predict the noise added at each step. At inference, start from x_T ~ N(0, I) and iteratively apply:
x_{t-1} = (1/sqrt(α_t)) * (x_t - β_t/sqrt(1-ᾱ_t) * ε_θ(x_t, t)) + σ_t * z
Conditioning: For text-to-image/video, cross-attention layers let the denoising network attend to text embeddings, steering generation toward the prompt.
Latent diffusion (LDM): Run diffusion in a compressed VAE latent space rather than pixel space. Orders of magnitude cheaper. Used by almost all modern models.
DDPM: The Canonical Formulation
The forward process adds Gaussian noise over T=1000 steps according to a variance schedule β_1,…,β_T. The key closed-form: you can corrupt any image to any noise level t in one step without simulating the chain:
q(x_t | x_0) = N(x_t; √ᾱ_t · x_0, (1-ᾱ_t)I)
where ᾱ_t = ∏_{s=1}^{t} (1-β_s). This makes training efficient: sample a random t, corrupt the image to that noise level, train the network to predict what noise was added.
The reverse process trains a neural network (U-Net) to predict the noise ε̂ at each timestep. The training objective simplifies to:
L = E[||ε - ε_θ(x_t, t)||²]
Plain mean squared error on noise prediction. This ε-parameterization has an elegant interpretation: the network is learning the score function — the gradient of the log-probability density — connecting diffusion to score matching and Langevin dynamics.
Connection to score matching: the denoising objective is equivalent to learning to score the data density at each noise level. This provides a theoretical underpinning that GANs lack: you can analyze diffusion models as stochastic differential equations with known mathematical properties.
Key Sources
- ddpm-denoising-diffusion-probabilistic-models — the DDPM paper (Ho et al., 2020) that established the canonical formulation
- latent-diffusion-models-high-resolution-image-synthesis — LDM / Stable Diffusion: running diffusion in VAE latent space
- dit-scalable-diffusion-models-with-transformers — DiT: replacing U-Net with a ViT, demonstrating FID scales as a power law with compute
- numina-counting-text-to-video — training-free fix for counting errors in DiT-based T2V diffusion
Related Concepts
Open Questions
- Sample efficiency: diffusion requires many denoising steps (though distillation/consistency models reduce this)
- Compositional control: hard to reliably generate specific counts, relations, and spatial layouts
- Video: temporal consistency across frames remains an active challenge