NUMINA: When Numbers Speak — Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Summary

NUMINA (CVPR 2026) is a training-free framework that fixes counting errors in text-to-video diffusion models. The core finding: numeral tokens (“three”, “five”) produce systematically diffuse, weak cross-attention patterns compared to nouns and adjectives — the model never truly learned to count. NUMINA runs a short walk-through generation, selects the most instance-discriminative self-attention head and the most semantically-concentrated cross-attention head per noun, fuses them into an explicit countable layout, corrects the count via minimal spatial edits, then regenerates with cross-attention modulation that guides the model toward the corrected layout.

Key Claims

Numeral tokens have structurally weaker cross-attention grounding than other token types — a systematic property of DiT-based T2V models, not a training deficiency that scale alone will fix
Training-free: no new parameters, no fine-tuning required — works as a plug-in on top of any DiT-based video generation model
Improves counting accuracy by up to 7.4% on Wan2.1-1.3B, 4.9% on 5B, 5.5% on 14B (evaluated on the introduced CountBench benchmark: 210 prompts, counts 1–8, 1–3 object categories)
CLIP score improves alongside counting accuracy — correct object counts lead to cleaner, more coherent scene compositions
Integrates with EasyCache inference acceleration, reducing overhead of the walk-through pass

Methods

Identify phase: Run ~20 denoising steps. Score all self-attention heads by a composite separability metric S = S1 + S2 + γ·S3 (foreground contrast + structural richness + edge clarity via Sobel). Score cross-attention heads per noun token by peak activation concentration. Select the best of each, fuse via cross-attention-masked clustering to extract a countable instance layout.

Guide phase: If count mismatches prompt, correct layout conservatively: remove smallest instance (minimum spatial disruption), or add instance by copying smallest existing template at cost-minimizing placement (penalizing overlap, distance from group center, frame-to-frame displacement). Modulate cross-attention in regeneration pass with time-decaying intensity δ(t) — strong early (layout setting), weak late (detail filling).

CountBench: 210 prompts, 5 generated videos each, covering single-category and multi-category object counting from 1 to 8 objects.

Failure modes

High counts (6–8): compressed latent space makes instance separation unreliable
Occluded/overlapping objects: clustering merges nearby instances, causing overcorrection
Zero-instance correction: must use circle template rather than copy of existing object, producing unnatural results

Connections

clip-learning-transferable-visual-models — the vision-language alignment intuition; CLIP-style pretraining teaches co-occurrence, not structural count constraints
attention-is-all-you-need — the attention mechanism that NUMINA reads and modulates
video-generation — the task being improved: generating temporally coherent video from text prompts
diffusion-models — NUMINA operates as a plug-in on top of DiT-based text-to-video diffusion models
attention — cross-attention and self-attention patterns are analyzed and modulated to fix counting

Citation

arXiv:2604.08546

@inproceedings{sun2026numina,
  title={When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models},
  author={Sun, Zhengyang and Chen, Yu and Zhou, Xin and Li, Xiaofan and Chen, Xiwu and Liang, Dingkang and Bai, Xiang},
  booktitle={CVPR},
  year={2026}
}

ML Wiki

Explorer