Summary
NUMINA (CVPR 2026) is a training-free framework that fixes counting errors in text-to-video diffusion models. The core finding: numeral tokens (“three”, “five”) produce systematically diffuse, weak cross-attention patterns compared to nouns and adjectives — the model never truly learned to count. NUMINA runs a short walk-through generation, selects the most instance-discriminative self-attention head and the most semantically-concentrated cross-attention head per noun, fuses them into an explicit countable layout, corrects the count via minimal spatial edits, then regenerates with cross-attention modulation that guides the model toward the corrected layout.
Key Claims
- Numeral tokens have structurally weaker cross-attention grounding than other token types — a systematic property of DiT-based T2V models, not a training deficiency that scale alone will fix
- Training-free: no new parameters, no fine-tuning required — works as a plug-in on top of any DiT-based video generation model
- Improves counting accuracy by up to 7.4% on Wan2.1-1.3B, 4.9% on 5B, 5.5% on 14B (evaluated on the introduced CountBench benchmark: 210 prompts, counts 1–8, 1–3 object categories)
- CLIP score improves alongside counting accuracy — correct object counts lead to cleaner, more coherent scene compositions
- Integrates with EasyCache inference acceleration, reducing overhead of the walk-through pass
Methods
Identify phase: Run ~20 denoising steps. Score all self-attention heads by a composite separability metric S = S1 + S2 + γ·S3 (foreground contrast + structural richness + edge clarity via Sobel). Score cross-attention heads per noun token by peak activation concentration. Select the best of each, fuse via cross-attention-masked clustering to extract a countable instance layout.
Guide phase: If count mismatches prompt, correct layout conservatively: remove smallest instance (minimum spatial disruption), or add instance by copying smallest existing template at cost-minimizing placement (penalizing overlap, distance from group center, frame-to-frame displacement). Modulate cross-attention in regeneration pass with time-decaying intensity δ(t) — strong early (layout setting), weak late (detail filling).
CountBench: 210 prompts, 5 generated videos each, covering single-category and multi-category object counting from 1 to 8 objects.
Failure modes
- High counts (6–8): compressed latent space makes instance separation unreliable
- Occluded/overlapping objects: clustering merges nearby instances, causing overcorrection
- Zero-instance correction: must use circle template rather than copy of existing object, producing unnatural results
Connections
- clip-learning-transferable-visual-models — the vision-language alignment intuition; CLIP-style pretraining teaches co-occurrence, not structural count constraints
- attention-is-all-you-need — the attention mechanism that NUMINA reads and modulates
- video-generation — the task being improved: generating temporally coherent video from text prompts
- diffusion-models — NUMINA operates as a plug-in on top of DiT-based text-to-video diffusion models
- attention — cross-attention and self-attention patterns are analyzed and modulated to fix counting
Citation
@inproceedings{sun2026numina,
title={When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models},
author={Sun, Zhengyang and Chen, Yu and Zhou, Xin and Li, Xiaofan and Chen, Xiwu and Liang, Dingkang and Bai, Xiang},
booktitle={CVPR},
year={2026}
}