Video Generation (Diffusion-Based)

What It Is

Video generation models produce sequences of temporally consistent image frames from a text prompt or image input. Modern approaches are almost exclusively based on diffusion models operating in a compressed latent space (DiT — Diffusion Transformer).

Why It Matters

Text-to-video generation is the natural extension of text-to-image. It requires learning not just spatial structure but temporal dynamics: objects must move plausibly, lighting must stay consistent, and actions must follow physics. These constraints are harder to encode than static composition.

How It Works

Most modern T2V models use a Diffusion Transformer (DiT) backbone operating in a compressed video latent space. A VAE encodes frames into latents; the diffusion process denoises in that latent space; cross-attention between text tokens and spatial/temporal latents conditions the generation on the prompt.

The key training objective is to predict the noise added at each diffusion step, conditioned on the text. Temporal consistency comes from 3D attention (or temporal attention layers) that allow each frame to attend to other frames.

Key Sources

numina-counting-text-to-video — identifies counting failures as a systematic property of DiT T2V cross-attention
neural-computers
rag-retrieval-augmented-generation
sam-2-segment-anything-in-images-and-videos

Open Questions

Long-form video coherence beyond ~10s clips
Physical realism (rigid body, fluid dynamics)
Counting and compositional accuracy — models learn co-occurrence, not structural constraints

ML Wiki

Explorer

Video Generation (Diffusion-Based)

What It Is

Why It Matters

How It Works

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Video Generation (Diffusion-Based)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks