- 19x faster convergence ⚡
- 370x less FLOPS than FLUX-dev 📉
- 19x faster convergence ⚡
- 370x less FLOPS than FLUX-dev 📉
It outperforms CLIP-like models (SigLip2, finetuned StreetCLIP)… and that’s shocking 🤯
Why? CLIP models have an innate advantage — they literally learn place names + images. DinoV3 doesn’t.
It outperforms CLIP-like models (SigLip2, finetuned StreetCLIP)… and that’s shocking 🤯
Why? CLIP models have an innate advantage — they literally learn place names + images. DinoV3 doesn’t.
Cc @loicland.bsky.social @davidpicard.bsky.social @vickykalogeiton.bsky.social
Cc @loicland.bsky.social @davidpicard.bsky.social @vickykalogeiton.bsky.social
How far can we go with ImageNet for Text-to-Image generation? w. @arrijitghosh.bsky.social @lucasdegeorge.bsky.social @nicolasdufour.bsky.social @vickykalogeiton.bsky.social
TL;DR: Train a text-to-image model using 1000 less data in 200 GPU hrs!
📜https://arxiv.org/abs/2502.21318
🧵👇
☑️With MAtCha, we leverage a pretrained depth model to recover sharp meshes from sparse views including both foreground and background, within mins!🧵
🌐Webpage: anttwo.github.io/matcha/
☑️With MAtCha, we leverage a pretrained depth model to recover sharp meshes from sparse views including both foreground and background, within mins!🧵
🌐Webpage: anttwo.github.io/matcha/
🗺️ Paper, code, and demo: nicolas-dufour.github.io/plonk
🗺️ Paper, code, and demo: nicolas-dufour.github.io/plonk