📍Prague, CZ. 🔗 http://users.ntua.gr/psomasbill/
🔎Given a query image + an edit (“during night”), retrieve the same specific instance after the change — not just any similar object.
🛢New dataset on HF: i-CIR huggingface.co/datasets/bil...
🔥Download, run, and share results!
🔎Given a query image + an edit (“during night”), retrieve the same specific instance after the change — not just any similar object.
🛢New dataset on HF: i-CIR huggingface.co/datasets/bil...
🔥Download, run, and share results!
REGLUE (SiT-B/2) achieves 12.9 and 28.7 FID at 400K iterations in conditional and unconditional generation, respectively, outperforming REPA, ReDi, and REG. REGLUE (SiT-XL/2) matches 1M-step SOTA performance in just 700k iterations (~30% fewer steps).
REGLUE (SiT-B/2) achieves 12.9 and 28.7 FID at 400K iterations in conditional and unconditional generation, respectively, outperforming REPA, ReDi, and REG. REGLUE (SiT-XL/2) matches 1M-step SOTA performance in just 700k iterations (~30% fewer steps).
Do compressed patch features retain VFM semantics?
Points show frozen compressed DINOv2 semantics (x: ImageNet top-1 / Cityscapes mIoU) vs SiT-B generation quality (y: ImageNet FID) when trained on VAE latents + compressed features.
Do compressed patch features retain VFM semantics?
Points show frozen compressed DINOv2 semantics (x: ImageNet top-1 / Cityscapes mIoU) vs SiT-B generation quality (y: ImageNet FID) when trained on VAE latents + compressed features.
Linear PCA can limit patch-level semantics (e.g., ReDi). We introduce a lightweight non-linear semantic compressor that aggregates multi-layer VFM features into a compact, semantics-preserving space, boosting quality (21.4 → 13.3 FID).
Linear PCA can limit patch-level semantics (e.g., ReDi). We introduce a lightweight non-linear semantic compressor that aggregates multi-layer VFM features into a compact, semantics-preserving space, boosting quality (21.4 → 13.3 FID).
REGLUE puts these into one unified model and jointly models:
1️⃣ VAE latents (pixels)
2️⃣ local semantics (compressed patch features)
3️⃣ global [CLS] (concept)
➕ alignment loss as a complementary auxiliary boost.
REGLUE puts these into one unified model and jointly models:
1️⃣ VAE latents (pixels)
2️⃣ local semantics (compressed patch features)
3️⃣ global [CLS] (concept)
➕ alignment loss as a complementary auxiliary boost.
Jointly modeling compressed patch-level semantics ➕ VAE latents provides spatial guidance and yields larger gains than alignment-only (REPA) or global-only (REG).
Alignment loss and a global [CLS] token stay complementary, orthogonal signals.
Jointly modeling compressed patch-level semantics ➕ VAE latents provides spatial guidance and yields larger gains than alignment-only (REPA) or global-only (REG).
Alignment loss and a global [CLS] token stay complementary, orthogonal signals.
We introduce REGLUE: a unified framework that entangles VAE latents ➕ Global ➕ Local semantics for faster, higher-fidelity image generation.
Links (paper + code) at the end👇
We introduce REGLUE: a unified framework that entangles VAE latents ➕ Global ➕ Local semantics for faster, higher-fidelity image generation.
Links (paper + code) at the end👇
Come by Poster Session 6, Fri 16:30, #4514 🧵
We present instance-level composed image retrieval, the new i-CIR dataset, and our training-free method BASIC.
Drop in and say hi!
Come by Poster Session 6, Fri 16:30, #4514 🧵
We present instance-level composed image retrieval, the new i-CIR dataset, and our training-free method BASIC.
Drop in and say hi!
⚡BASIC: training-free pipeline (centering, projection with PCA, textual contextualization, Harris-style fusion) with strong results across i-CIR and class-level CIR benchmarks.
⚡BASIC: training-free pipeline (centering, projection with PCA, textual contextualization, Harris-style fusion) with strong results across i-CIR and class-level CIR benchmarks.
📊~750K images, 202 instances, ~1,900 composed queries. Despite small per-query DBs (~3.7K images), i-CIR matches the difficulty of searching with >40M random distractors.
📊~750K images, 202 instances, ~1,900 composed queries. Despite small per-query DBs (~3.7K images), i-CIR matches the difficulty of searching with >40M random distractors.
🗂️ Per instance we share a database and define:
- composed positives (same object + modification)
- hard negatives:
- visual (same/similar object, wrong text)
- textual (right text, wrong instance)
- composed (near-miss on both).
🗂️ Per instance we share a database and define:
- composed positives (same object + modification)
- hard negatives:
- visual (same/similar object, wrong text)
- textual (right text, wrong instance)
- composed (near-miss on both).
🔎 Gap in the community: Existing CIR benchmarks are class-level, ambiguous, without explicit hard negatives, and often reward text-only behaviour. We needed a dataset that truly requires both image and text, at the instance level. i-CIR fills that gap.
🔎 Gap in the community: Existing CIR benchmarks are class-level, ambiguous, without explicit hard negatives, and often reward text-only behaviour. We needed a dataset that truly requires both image and text, at the instance level. i-CIR fills that gap.
🎨 Task: given (image of an object instance) + (text modification), retrieve photos of that exact instance under the change.
E.g.: Temple of Poseidon 🏛️ ➕ during sunset 🌅
📦 Project page: vrg.fel.cvut.cz/icir/
🎨 Task: given (image of an object instance) + (text modification), retrieve photos of that exact instance under the change.
E.g.: Temple of Poseidon 🏛️ ➕ during sunset 🌅
📦 Project page: vrg.fel.cvut.cz/icir/