Stub — full ingest pending.

Oquab et al. (2023) scale and systematize DINO with three additions: a curated 142M-image dataset (LVD-142M) automatically filtered for quality and diversity, a distillation loss combined with iBOT’s masked image modeling term, and training at larger scale. DINOv2 features transfer to depth estimation, semantic segmentation, and image classification in a frozen backbone setting — no fine-tuning — achieving state-of-the-art or competitive results across all three. DINOv2 has become a standard visual backbone for multimodal and robotics systems.

Key claim: Scaling DINO with curated data and combined self-distillation and masked image modeling produces a universal frozen visual backbone that outperforms prior self-supervised models across diverse tasks.