You want to generate a photorealistic image from scratch. GANs were the dominant answer for years — train a generator and discriminator in adversarial competition, hope they converge, fight mode collapse, watch training blow up. The results when it worked were stunning. The reliability was terrible. In 2020, Ho et al. proposed a different answer: don’t generate images directly. Instead, learn to remove noise, step by step. The result was DDPM — a model that started a paradigm shift and became the foundation of Stable Diffusion, DALL-E 2, Imagen, and virtually every image generation system that followed.
The core idea
The analogy: Imagine you have a perfect photo of a dog. You add a tiny bit of random static to it — barely perceptible. Then you add a bit more. Then more. After 1,000 steps of adding Gaussian noise, you have pure static — completely random pixels with no dog visible. This is the forward process, and it’s trivial to simulate mathematically.
The key insight: if you train a neural network to reverse just one step — to take a slightly-noisy image and predict a slightly-less-noisy version — you can chain 1,000 of those steps together to generate an image from pure noise. You never have to teach the network to generate a dog from nothing. You just teach it to remove a little noise. That’s a much easier problem.
DDPM makes this precise. The forward process adds Gaussian noise according to a fixed schedule:
At each step , you scale down the signal slightly (by ) and add noise with variance . After steps with a carefully chosen schedule, is approximately standard Gaussian noise — all signal destroyed.
The crucial trick: you don’t have to step through 1000 forward steps one at a time. There’s a closed form for any step directly from :
where . This means during training you can corrupt any image to any noise level instantly, without simulating the entire chain.
The mechanism, step by step
Training:
- Sample a clean image from your dataset.
- Sample a random timestep uniformly from .
- Sample Gaussian noise .
- Construct the noisy version: (this uses the closed-form shortcut).
- Feed and into a U-Net neural network.
- The network predicts — its guess at what noise was added.
- Loss: — simple mean squared error on the noise.
Inference (image generation):
- Start with — pure random noise.
- For :
- Ask the network: “what noise was in this image at step ?”
- Use that prediction to compute (the slightly less noisy version).
- Add a small amount of fresh noise (unless ) to maintain stochasticity.
- is your generated image.
TRAINING:
Clean image x_0 ──[add noise at random level t]──> noisy x_t
|
U-Net predicts noise eps_hat
|
loss = ||eps_true - eps_hat||^2
INFERENCE:
Pure noise x_T
|
[U-Net removes a bit of noise] → x_{T-1}
|
[U-Net removes a bit of noise] → x_{T-2}
|
... (1000 steps)
|
x_0: generated image
Why predict the noise and not the image directly?
The paper tried both. Predicting noise (called the “-parameterization”) worked better empirically and has an elegant interpretation: the network is learning the score function — the gradient of the log-probability density with respect to the data. This connection to score matching and Langevin dynamics is what the abstract refers to with “nonequilibrium thermodynamics.” The diffusion process is a discretized version of a stochastic differential equation; the reverse process is another SDE that can be solved.
Find the instinct
Why does this work at all?
The deep insight is that noise removal at different scales corresponds to learning different aspects of image structure. When you remove noise from a nearly-pure-noise image (high t), you’re learning global structure — is this an outdoor scene or a face? When you remove noise from a nearly-clean image (low t), you’re filling in fine details — where exactly is the pupil?
By training on all noise levels simultaneously, the model learns a full hierarchical understanding of image structure. It’s like having specialized experts, each cleaning up a slightly different level of corruption, all sharing weights through the U-Net backbone.
The connection to GANs: a GAN’s discriminator effectively learns “does this look like a real image?” The diffusion model’s U-Net learns “does this look like the output of the denoising process?” That second question turns out to be easier to train on because the answer is a real number (the noise), not a binary judgment, and the model gets dense gradient signal at every training step rather than depending on an adversarial equilibrium.
“We show that diffusion models are actually equivalent to a multi-scale denoising autoencoder, and this connection provides a new understanding of diffusion models as hierarchical feature learning.”
Architecture: the U-Net
The paper uses a U-Net — a convolutional architecture originally designed for medical image segmentation. It has:
- An encoder that progressively downsamples the image (compressing spatial resolution, increasing channels)
- A decoder that upsamples back to original resolution
- Skip connections between corresponding encoder and decoder layers
The timestep t is injected into every residual block as an embedding (similar to positional encoding in Transformers). This tells the network what noise level it’s working at. Later work added cross-attention for conditioning on text prompts — that’s how text-to-image generation works.
Results
On CIFAR-10 at the time of publication:
- FID score of 3.17 (lower is better; this was state-of-the-art)
- Inception score of 9.46
On 256×256 LSUN (bedroom, horse images): sample quality comparable to ProgressiveGAN, with more diversity. GANs tend to get sharp images but lack diversity (mode collapse). Diffusion models get both sharpness and diversity because the stochastic sampling process naturally explores the full distribution.
The limitations:
- Speed: Generating one image requires 1000 forward passes through the U-Net. GANs need one. This 1000× slowdown made diffusion impractical for real-time applications at first.
- Latent diffusion: The DDPM paper runs diffusion directly in pixel space. Stable Diffusion fixes this by running diffusion in the compressed latent space of a VAE — 8× smaller spatial dimensions, dramatically faster.
- No text conditioning: The original DDPM is unconditional. CLIP and classifier guidance came later to enable text-to-image.
So what?
DDPM replaced GANs as the dominant image generation paradigm within two years. Every major text-to-image model — Stable Diffusion (which uses latent diffusion), DALL-E 2, Imagen, Midjourney — is built on diffusion principles. The key improvements that followed: DDIM (faster sampling via deterministic trajectories, reducing steps from 1000 to 50), classifier-free guidance (conditioning on prompts without a separate classifier), and latent diffusion (running in compressed space).
If you’re working with image generation, understanding DDPM is mandatory. It also generalizes: diffusion models now work on audio (WaveGrad), video, protein structures (AlphaFold 3), and molecular design.
Connections
- diffusion-models — the technique introduced in this paper
- transformer — later U-Net architectures replaced convolutions with attention, leading to DiT (Diffusion Transformer)
- clip-learning-transferable-visual-models — CLIP embeddings power text conditioning in DALL-E 2 and Stable Diffusion
- attention-is-all-you-need — attention mechanisms are incorporated into diffusion U-Nets for cross-attention to text
Citation
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. https://arxiv.org/abs/2006.11239