What It Is

A generative model that learns to encode data into a continuous latent distribution and decode samples from that distribution back to data — training both encoder and decoder jointly with a reconstruction loss plus a KL divergence regularizer that keeps the latent space smooth and contiguous.

Why It Matters

VAEs are the compression backbone of Latent Diffusion Models. The VQ-VAE (vector-quantized variant) in Stable Diffusion compresses images to 4-8x smaller latent representations. Diffusion then operates in this compressed space, cutting training and inference cost dramatically while preserving perceptual quality.

How It Works

The encoder outputs a mean and variance rather than a single point: . The training objective is the Evidence Lower BOund (ELBO):

The reconstruction term pushes the decoder to faithfully reproduce inputs. The KL term regularizes the latent space toward a standard normal , ensuring smooth interpolation and valid samples from random .

Key Sources