Falcon Perception: Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation

What It Is

Falcon Perception is a 0.6B-parameter early-fusion Transformer from TII UAE that handles open-vocabulary grounding and segmentation via natural language prompts, processing image patches and text in one unified sequence with a hybrid attention mask. A companion 0.3B model, Falcon OCR, applies the same architecture to document understanding and achieves state-of-the-art throughput for open-source OCR.

Key Contributions

Early-fusion single backbone: image patches and text tokens share parameter space from layer one; no separate vision encoder or late-fusion decoder
Hybrid attention mask: image tokens attend bidirectionally to all image tokens; text/task tokens attend causally to the full visual prefix plus preceding text
Chain-of-Perception interface: structured <coord> → <size> → <seg> token sequence for variable-length instance prediction without fixed-query decoder constraints
Fourier feature heads: continuous coordinate/size prediction using random Gaussian projections into sinusoidal space, re-injected as conditioning signals
Multi-teacher distillation initialization: DINOv3 (ViT-H) for local features + SigLIP2 for language-aligned features; 74.25% zero-shot ImageNet-1k accuracy before perception training
PBench: new diagnostic benchmark separating grounding capability by type (L0–L4: simple objects, attributes, OCR-guided, spatial, relational) and a Dense stress-test split
SA-Co results: 68.0 Macro-F1 vs 62.3 for SAM 3; largest gains on attribute-heavy, food/drink, and sports equipment categories
Falcon OCR: 80.3% on olmOCR, 88.64 on OmniDocBench at 0.3B parameters — highest throughput of any open-source OCR model; ~3x smaller than 0.9B-class competitors
Training recipe: three-stage (in-context listing → task alignment → long-context finetuning), 700B tokens total, 54M images, 195M positive expressions, 488M hard negatives
Key ablations: Muon optimizer for heads (+4.8 pts), raster ordering (+10 pts over random), Gram feature regularization (+1.5 pts), global loss normalization for FSDP packing

Key Concepts

Key Entities

tii-uae

Open Questions

Presence calibration gap remains large (MCC 0.64 vs 0.82 for SAM 3) — unclear whether this is a training signal, data balance, or architecture issue
How well does the hybrid attention pattern generalize to video or 3D inputs beyond static images?
Whether Chain-of-Perception’s raster ordering assumption breaks down for highly cluttered or non-canonical scenes
Whether Falcon OCR’s from-scratch training (no distillation) approach generalizes to multilingual OCR
Scaling behavior of the early-fusion single-stack recipe beyond 0.6B parameters

ML Wiki

Explorer

Falcon Perception: Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation

What It Is

Key Contributions

Key Concepts

Key Entities

Open Questions

Graph View

Table of Contents

Backlinks