What It Is
Video generation models produce sequences of temporally consistent image frames from a text prompt or image input. Modern approaches are almost exclusively based on diffusion models operating in a compressed latent space (DiT — Diffusion Transformer).
Why It Matters
Text-to-video generation is the natural extension of text-to-image. It requires learning not just spatial structure but temporal dynamics: objects must move plausibly, lighting must stay consistent, and actions must follow physics. These constraints are harder to encode than static composition.
How It Works
Most modern T2V models use a Diffusion Transformer (DiT) backbone operating in a compressed video latent space. A VAE encodes frames into latents; the diffusion process denoises in that latent space; cross-attention between text tokens and spatial/temporal latents conditions the generation on the prompt.
The key training objective is to predict the noise added at each diffusion step, conditioned on the text. Temporal consistency comes from 3D attention (or temporal attention layers) that allow each frame to attend to other frames.
Key Sources
-
numina-counting-text-to-video — identifies counting failures as a systematic property of DiT T2V cross-attention
Related Concepts
Open Questions
- Long-form video coherence beyond ~10s clips
- Physical realism (rigid body, fluid dynamics)
- Counting and compositional accuracy — models learn co-occurrence, not structural constraints