What It Is

Diffusion models are generative models that learn to reverse a gradual noising process. They add Gaussian noise to data over many steps until it becomes pure noise, then train a neural network to denoise — step by step — back to the original. At inference, they start from random noise and iteratively refine.

Why It Matters

Diffusion models are the dominant approach for image, video, and audio generation. They produce higher-quality, more diverse samples than GANs and are more stable to train. DALL-E 3, Stable Diffusion, Sora, and Wan are all diffusion-based.

How It Works

Forward process: Add Gaussian noise over T steps. At step t: x_t = sqrt(α_t) * x_0 + sqrt(1 - α_t) * ε, where ε ~ N(0, I). After enough steps, x_T ≈ pure noise regardless of x_0.

Reverse process: Train a network ε_θ to predict the noise added at each step. At inference, start from x_T ~ N(0, I) and iteratively apply: x_{t-1} = (1/sqrt(α_t)) * (x_t - β_t/sqrt(1-ᾱ_t) * ε_θ(x_t, t)) + σ_t * z

Conditioning: For text-to-image/video, cross-attention layers let the denoising network attend to text embeddings, steering generation toward the prompt.

Latent diffusion (LDM): Run diffusion in a compressed VAE latent space rather than pixel space. Orders of magnitude cheaper. Used by almost all modern models.

Key Sources

Open Questions

  • Sample efficiency: diffusion requires many denoising steps (though distillation/consistency models reduce this)
  • Compositional control: hard to reliably generate specific counts, relations, and spatial layouts
  • Video: temporal consistency across frames remains an active challenge