"Emerging Properties in Self-Supervised Vision Transformers." Self-distillation with no labels: a student network learns from a momentum-updated teacher. ViT-Base achieves 80.1% ImageNet top-1 without any labeled data.

DINO showed that self-supervised ViTs develop explicit information about semantic segmentation in their attention maps. The approach was extended in DINOv2 (2023) to produce all-purpose visual features from 142M curated images. ICCV 2021. By Caron, Touvron, Misra et al.

Paper

arXiv: 2104.14294

Venue: ICCV 2021

visionfoundationalopen-source