What It Is
A compressed, lower-dimensional representation of data learned by an encoder — the internal coordinate system a model uses to represent inputs. Points close together in latent space correspond to semantically similar inputs.
Why It Matters
Operating in latent space rather than raw pixel/token space is the key efficiency trick behind Stable Diffusion (Latent Diffusion Models). A 512×512 image has 786,432 values; its latent representation might be 64×64×4 = 16,384 values — 48x fewer. Running diffusion in this compressed space slashes compute while preserving semantic content.
How It Works
An encoder (typically a VAE) maps high-dimensional inputs to a lower-dimensional latent . A decoder reconstructs the original: . Generative models like diffusion models learn to operate in this latent space — adding and removing noise from rather than from directly. The decoder is only needed at inference time to convert latent samples back to pixels.