What It Is

A compressed, lower-dimensional representation of data learned by an encoder — the internal coordinate system a model uses to represent inputs. Points close together in latent space correspond to semantically similar inputs.

Why It Matters

Operating in latent space rather than raw pixel/token space is the key efficiency trick behind Stable Diffusion (Latent Diffusion Models). A 512×512 image has 786,432 values; its latent representation might be 64×64×4 = 16,384 values — 48x fewer. Running diffusion in this compressed space slashes compute while preserving semantic content.

How It Works

An encoder (typically a VAE) maps high-dimensional inputs to a lower-dimensional latent . A decoder reconstructs the original: . Generative models like diffusion models learn to operate in this latent space — adding and removing noise from rather than from directly. The decoder is only needed at inference time to convert latent samples back to pixels.

Key Sources