What It Is
Falcon Perception is a 0.6B-parameter early-fusion Transformer from TII UAE that handles open-vocabulary grounding and segmentation via natural language prompts, processing image patches and text in one unified sequence with a hybrid attention mask. A companion 0.3B model, Falcon OCR, applies the same architecture to document understanding and achieves state-of-the-art throughput for open-source OCR.
Key Contributions
- Early-fusion single backbone: image patches and text tokens share parameter space from layer one; no separate vision encoder or late-fusion decoder
- Hybrid attention mask: image tokens attend bidirectionally to all image tokens; text/task tokens attend causally to the full visual prefix plus preceding text
- Chain-of-Perception interface: structured
<coord> → <size> → <seg>token sequence for variable-length instance prediction without fixed-query decoder constraints - Fourier feature heads: continuous coordinate/size prediction using random Gaussian projections into sinusoidal space, re-injected as conditioning signals
- Multi-teacher distillation initialization: DINOv3 (ViT-H) for local features + SigLIP2 for language-aligned features; 74.25% zero-shot ImageNet-1k accuracy before perception training
- PBench: new diagnostic benchmark separating grounding capability by type (L0–L4: simple objects, attributes, OCR-guided, spatial, relational) and a Dense stress-test split
- SA-Co results: 68.0 Macro-F1 vs 62.3 for SAM 3; largest gains on attribute-heavy, food/drink, and sports equipment categories
- Falcon OCR: 80.3% on olmOCR, 88.64 on OmniDocBench at 0.3B parameters — highest throughput of any open-source OCR model; ~3x smaller than 0.9B-class competitors
- Training recipe: three-stage (in-context listing → task alignment → long-context finetuning), 700B tokens total, 54M images, 195M positive expressions, 488M hard negatives
- Key ablations: Muon optimizer for heads (+4.8 pts), raster ordering (+10 pts over random), Gram feature regularization (+1.5 pts), global loss normalization for FSDP packing
Key Concepts
- early-fusion
- visual-grounding
- open-vocabulary-segmentation
- distillation
- chain-of-thought
- attention
- transformer
- inference-efficiency
Key Entities
Open Questions
- Presence calibration gap remains large (MCC 0.64 vs 0.82 for SAM 3) — unclear whether this is a training signal, data balance, or architecture issue
- How well does the hybrid attention pattern generalize to video or 3D inputs beyond static images?
- Whether Chain-of-Perception’s raster ordering assumption breaks down for highly cluttered or non-canonical scenes
- Whether Falcon OCR’s from-scratch training (no distillation) approach generalizes to multilingual OCR
- Scaling behavior of the early-fusion single-stack recipe beyond 0.6B parameters