[Summary] DINOv3: Self-Supervised Vision Transformers at Scale
TLDR The DINO series advances self-supervised learning for vision transformers through iterative architectural and data refinements. DINOv1 introduces student-teacher distillation on ImageNet-1k. DINOv2 scales to 142M curated images with patch-level objectives. DINOv3 reaches 1.7B Instagram images with register tokens, new Gram matrix based loss, with a custom 7B-parameter ViT, achieving state-of-the-art performance on dense prediction tasks (like instance segmentation) while maintaining a frozen backbone. Motivation Supervised pretraining on ImageNet has dominated vision models, but manually annotating large datasets is expensive and constrains representation quality to label granularity....