High-Resolution Image Synthesis with Latent Diffusion Models

Concepts: diffusion-models | latent-space | vae | cross-attention Builds on: attention-is-all-you-need Leads to: dit-scalable-diffusion-models-with-transformers

In 2021, training a state-of-the-art diffusion model required hundreds of GPU days. Not because the math was hard — because the images are enormous. A 512×512 image has 786,432 pixel values, and diffusion models run a neural network over all of them at every one of thousands of denoising steps. The compute was catastrophic enough that only large research labs could run these experiments. Rombach et al. asked a simple question: does diffusion actually need to happen in pixel space? The answer was no — and the result was Latent Diffusion Models (LDMs), the architecture that became Stable Diffusion.

The core idea

The analogy: imagine you’re an artist painting a large portrait. One approach is to work directly on the final high-resolution canvas from the start, where every brushstroke is expensive and every mistake requires re-painting. The second approach: sketch in a small notebook first. The sketch captures everything that matters — composition, proportions, the subject’s expression. When the sketch looks right, you transfer it to the large canvas and add detail. The creative work happened in the compact sketch. The canvas step is just rendering.

LDM is the second approach. The iterative refinement — the diffusion process — happens in a compact latent space, not in pixel space. Once the latent sketch is finalized, a single decoder pass renders it to pixels.

Stage 1: Build the compression codec (train once, freeze)

Train a VQ-regularized or KL-regularized autoencoder:

Encoder $E$ : compresses image $x \in R^{H \times W \times 3}$ to latent $z = E (x) \in R^{h \times w \times c}$
Decoder $D$ : reconstructs $\tilde{x} = D (z)$
Downsampling factor $f = H / h = W / w$ (tested across $f \in {1, 2, 4, 8, 16, 32}$ )
Training loss: perceptual loss + patch-based adversarial objective (not plain MSE — this forces the decoder to learn textures, not blurry averages)

This stage performs perceptual compression: strip out details below the threshold of human perception while preserving semantically meaningful structure.

Stage 2: Diffuse in latent space

Standard DDPM forward process, applied to $z$ instead of $x$ :

$q (z_{t} ∣ z_{t - 1}) = N (z_{t}; 1 - β_{t} z_{t - 1}, β_{t} I)$

Train a U-Net $ϵ_{θ}$ to predict the noise at each timestep:

$L_{L D M} = E_{E (x), ϵ \sim N (0, 1), t} [∥ ϵ - ϵ_{θ} (z_{t}, t) ∥_{2}^{2}]$

Same objective as pixel-space DDPM — just applied to the compact latent $z$ instead of raw pixels $x$ .

Stage 3: Conditioning via cross-attention

To condition on text, class labels, bounding boxes, or semantic maps, a domain-specific encoder $τ_{θ}$ (BERT for text, ResNet for images, etc.) maps the conditioning input $y$ to a sequence of embeddings. The U-Net backbone receives them via cross-attention layers inserted at each resolution:

$Q = W_{Q}^{(i)} \cdot φ_{i} (z_{t}), K = W_{K}^{(i)} \cdot τ_{θ} (y), V = W_{V}^{(i)} \cdot τ_{θ} (y)$

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d}) \cdot V$

The conditioned training objective becomes:

$L_{L D M} = E_{E (x), y, ϵ, t} [∥ ϵ - ϵ_{θ} (z_{t}, t, τ_{θ} (y)) ∥_{2}^{2}]$

The U-Net’s spatial features act as queries; the conditioning tokens act as keys and values. This is the cross-attention mechanism from “Attention Is All You Need” — unchanged.

PIXEL-SPACE DIFFUSION (expensive):
  x_T (512×512×3) ───────────────────────── x_0 (512×512×3)
                    1000 denoising steps
                    786K dims per U-Net call

LATENT DIFFUSION (LDM):

  STAGE 1 — Perceptual compression (once, frozen):
  x (512×512×3) ──[Encoder E, f=8]──► z (64×64×4)  = 16K dims  (49× smaller)
  z             ──[Decoder D]      ──► x̃ (512×512×3)

  STAGE 2 — Diffusion in latent space (all iterations here):
  z_T (64×64×4) ─────────────────────────── z_0 (64×64×4)
                  1000 denoising steps           ↑
                  16K dims per step   [cross-attn: text/class injected here]

  STAGE 3 — Single decode (one forward pass, fast):
  z_0 ──[Decoder D]──► x̃ (512×512×3)

DIMENSION COMPARISON (512×512 input):
  Pixel space (f=1):   786,432 dimensions per step
  LDM f=4:              49,152 dimensions per step  (16× fewer)
  LDM f=8:              16,384 dimensions per step  (48× fewer)
  LDM f=16:              4,096 dimensions per step  (192× fewer — too lossy)

Numerical walkthrough:

Take a 256×256×3 image processed by LDM-8 ( $f = 8$ , $c = 4$ latent channels):

Encode: $z = E (x)$ , shape $= 32 \times 32 \times 4 = 4, 096$ dimensions (vs 196,608 for pixel space — 48× smaller)
Add noise at $t = 500$ : with linear noise schedule $β_{1} = 0.0001 \to β_{T} = 0.02$ , $\overset{α}{ˉ}_{500} \approx 0.28$ :

$z_{500} = 0.28 \cdot z + 1 - 0.28 \cdot ϵ = 0.529 \cdot z + 0.849 \cdot ϵ$

For one latent dimension with $z [0] = 0.8$ and $ϵ = 1.2$ : $z_{500} [0] = 0.529 \times 0.8 + 0.849 \times 1.2 = 0.423 + 1.019 = 1.442$
U-Net predicts noise: $\overset{ϵ}{^} = ϵ_{θ} (z_{500}, t = 500)$ . Say it predicts $\overset{ϵ}{^} [0] = 1.17$ (close to true $ϵ = 1.2$ ).
DDPM denoising step (simplified, ignoring stochastic term): $z_{499} [0] \approx \frac{1}{0.998} (1.442 - \frac{0.002}{0.72} \times 1.17) = 1.001 \times (1.442 - 0.00276) \approx 1.440$

Each step shaves a tiny slice of predicted noise. After 1000 steps, $z_{0} \approx z$ .
Decode once: $\tilde{x} = D (z_{0})$ — one pass through the decoder to render the full 256×256×3 image.

What’s clever — find the instinct:

Earlier work tried running diffusion in compressed spaces using simple downsampling or PCA. It didn’t work well because naive compression destroys perceptual quality in ways that matter. The LDM insight was using a learned, perceptually-trained autoencoder — one whose decoder is trained with adversarial and perceptual losses to hallucinate convincing textures. The diffusion model doesn’t need to generate fine-grained pixel textures. It just needs to get the semantic structure right in a compact space, and the decoder fills in the rest.

“training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation”

The second non-obvious move was conditioning via cross-attention. Earlier conditional diffusion models used simple class-label embeddings added to the noise predictor input — a one-size-fits-all approach that scales poorly. Cross-attention lets the model attend to any sequence of conditioning tokens — text descriptions, bounding box coordinates, depth maps, semantic segmentation labels — using the same architecture with no changes. Swap the conditioning encoder $τ_{θ}$ , and you get a completely different type of guided generation.

“By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes.”

“high-resolution synthesis becomes possible in a convolutional manner” — the spatial structure of the latent grid allows efficient convolutional processing at each scale.

Does it actually work? What breaks?

Task	LDM config	FID↓	Baseline	Notes
CelebA-HQ 256×256 (uncond.)	LDM-4	5.11	DDPM pixel: 4.70	Comparable quality, ~2.7× faster
LSUN-Bedrooms 256×256	LDM-4	2.95	DDPM: 4.90	Better FID and faster
LSUN-Churches 256×256	LDM-8	4.02	DDIM: 5.22	Better quality, ~4× cheaper
ImageNet 256×256 (class-cond.)	LDM-4	3.60	ADM: 4.59	Better FID without classifier guidance
MS-COCO 256×256 (text→image)	LDM	12.63	DALL-E: 17.89	Better FID with less training compute
Places (inpainting)	LDM	1.32	CoModGAN: 3.61	Large margin, SOTA at CVPR 2022

The speedup is real but nuanced. $f = 8$ is 4–8× faster than pixel DDPM at similar FID. But $f = 8$ compresses too aggressively for some tasks — fine texture detail lost in encoding cannot be recovered. $f = 4$ is preferred for highest quality, giving ~2–3× speedup instead.

What doesn’t work:

The two-stage approach introduces a hard ceiling. Any detail the encoder discards is gone before diffusion starts. For tasks requiring precise pixel-level fidelity — medical imaging, satellite imagery, very high magnification super-resolution — the autoencoder bottleneck shows.

Sampling is still sequential. LDM made 1000-step diffusion tractable on a single GPU; it didn’t make it fast. A generation at inference still takes seconds even with DDIM (50 steps), which blocks real-time applications.

The choice of $f$ isn’t free. Too-small $f$ wastes compute on pixel-level noise without helping quality. Too-large $f$ loses important structure. $f = 4$ or $f = 8$ is usually optimal, but this requires empirical validation per domain.

Training two stages adds complexity. The autoencoder must be fully trained and frozen before LDM training begins — unlike pixel diffusion, which is end-to-end.

So what?

If you’re building image or video generation systems today, you’re almost certainly using an LDM-style architecture: first-stage VAE compression followed by diffusion in the latent space. Stable Diffusion, SDXL, Kandinsky, and most open-source generation pipelines use exactly this two-stage design. The choice of $f$ controls the quality-speed trade-off: $f = 4$ for highest fidelity, $f = 8$ when throughput matters more than sharpest textures.

The cross-attention conditioning layer is the integration point for any guiding signal: swap in a CLIP or T5 encoder for text guidance; add a ControlNet stream for pose or depth conditioning; inject image CLIP embeddings for image-to-image generation. The architecture is modular because cross-attention is modular. That’s why the Stable Diffusion ecosystem could develop so many extensions so quickly — there’s one clean interface for all conditioning signals.

The deeper connection runs back to attention-is-all-you-need: the same cross-attention mechanism that made transformers universal sequence processors now makes diffusion models universal conditional generators. And dit-scalable-diffusion-models-with-transformers extends the logic further — replacing LDM’s U-Net backbone with a Vision Transformer, showing that the latent space design is architecture-agnostic and follows transformer scaling laws.

LDM’s contribution wasn’t a new loss function or a new architecture. It was recognizing that pixel-level generation and semantic structure generation are separable problems — and that separating them unlocks everything.

The architecture behind Stable Diffusion: run 1000 denoising steps in a 48× smaller space, decode once.

Connections

diffusion-models — LDM applies the DDPM objective in compressed latent space
latent-space — the perceptual autoencoder’s latent is where all diffusion computation happens
vae — Stage 1 encoder/decoder is a VQ-VAE or KL-regularized autoencoder
cross-attention — conditioning mechanism enabling any input modality without architecture changes
attention-is-all-you-need — cross-attention conditioning mechanism from the Transformer
dit-scalable-diffusion-models-with-transformers — replaces LDM’s U-Net backbone with a Vision Transformer

Citation

arXiv:2112.10752

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. https://arxiv.org/abs/2112.10752

ML Wiki

Explorer

High-Resolution Image Synthesis with Latent Diffusion Models

The core idea

Does it actually work? What breaks?

So what?

Connections

Citation

Graph View

Table of Contents

Backlinks