What's actually different between CLIP and DINOv2? CLIP knows what "Brazil" looks like: Rio's skyline, sidewalk patterns, and soccer jerseys.
We mapped 24,576 visual features in vision models using sparse autoencoders, revealing surprising differences in what they understand.
What's actually different between CLIP and DINOv2? CLIP knows what "Brazil" looks like: Rio's skyline, sidewalk patterns, and soccer jerseys.
We mapped 24,576 visual features in vision models using sparse autoencoders, revealing surprising differences in what they understand.