What It Is

Falcon Perception is a 0.6B-parameter early-fusion Transformer from TII UAE that handles open-vocabulary grounding and segmentation via natural language prompts, processing image patches and text in one unified sequence with a hybrid attention mask. A companion 0.3B model, Falcon OCR, applies the same architecture to document understanding and achieves state-of-the-art throughput for open-source OCR.

Key Contributions

  • Early-fusion single backbone: image patches and text tokens share parameter space from layer one; no separate vision encoder or late-fusion decoder
  • Hybrid attention mask: image tokens attend bidirectionally to all image tokens; text/task tokens attend causally to the full visual prefix plus preceding text
  • Chain-of-Perception interface: structured <coord> → <size> → <seg> token sequence for variable-length instance prediction without fixed-query decoder constraints
  • Fourier feature heads: continuous coordinate/size prediction using random Gaussian projections into sinusoidal space, re-injected as conditioning signals
  • Multi-teacher distillation initialization: DINOv3 (ViT-H) for local features + SigLIP2 for language-aligned features; 74.25% zero-shot ImageNet-1k accuracy before perception training
  • PBench: new diagnostic benchmark separating grounding capability by type (L0–L4: simple objects, attributes, OCR-guided, spatial, relational) and a Dense stress-test split
  • SA-Co results: 68.0 Macro-F1 vs 62.3 for SAM 3; largest gains on attribute-heavy, food/drink, and sports equipment categories
  • Falcon OCR: 80.3% on olmOCR, 88.64 on OmniDocBench at 0.3B parameters — highest throughput of any open-source OCR model; ~3x smaller than 0.9B-class competitors
  • Training recipe: three-stage (in-context listing → task alignment → long-context finetuning), 700B tokens total, 54M images, 195M positive expressions, 488M hard negatives
  • Key ablations: Muon optimizer for heads (+4.8 pts), raster ordering (+10 pts over random), Gram feature regularization (+1.5 pts), global loss normalization for FSDP packing

Key Concepts

Key Entities

Open Questions

  • Presence calibration gap remains large (MCC 0.64 vs 0.82 for SAM 3) — unclear whether this is a training signal, data balance, or architecture issue
  • How well does the hybrid attention pattern generalize to video or 3D inputs beyond static images?
  • Whether Chain-of-Perception’s raster ordering assumption breaks down for highly cluttered or non-canonical scenes
  • Whether Falcon OCR’s from-scratch training (no distillation) approach generalizes to multilingual OCR
  • Scaling behavior of the early-fusion single-stack recipe beyond 0.6B parameters