What It Is

Video generation models produce sequences of temporally consistent image frames from a text prompt or image input. Modern approaches are almost exclusively based on diffusion models operating in a compressed latent space (DiT — Diffusion Transformer).

Why It Matters

Text-to-video generation is the natural extension of text-to-image. It requires learning not just spatial structure but temporal dynamics: objects must move plausibly, lighting must stay consistent, and actions must follow physics. These constraints are harder to encode than static composition.

How It Works

Most modern T2V models use a Diffusion Transformer (DiT) backbone operating in a compressed video latent space. A VAE encodes frames into latents; the diffusion process denoises in that latent space; cross-attention between text tokens and spatial/temporal latents conditions the generation on the prompt.

The key training objective is to predict the noise added at each diffusion step, conditioned on the text. Temporal consistency comes from 3D attention (or temporal attention layers) that allow each frame to attend to other frames.

Key Sources

Open Questions

  • Long-form video coherence beyond ~10s clips
  • Physical realism (rigid body, fluid dynamics)
  • Counting and compositional accuracy — models learn co-occurrence, not structural constraints