📍Prague, CZ. 🔗 http://users.ntua.gr/psomasbill/
REGLUE shows that the way we leverage VFM semantics matters for diffusion. Combining compact local semantics with global context yields faster convergence and state-of-the-art image generation.
📄arXiv: arxiv.org/abs/2512.16636
💻Project: reglueyourlatents.github.io
REGLUE shows that the way we leverage VFM semantics matters for diffusion. Combining compact local semantics with global context yields faster convergence and state-of-the-art image generation.
📄arXiv: arxiv.org/abs/2512.16636
💻Project: reglueyourlatents.github.io
REGLUE (SiT-B/2) achieves 12.9 and 28.7 FID at 400K iterations in conditional and unconditional generation, respectively, outperforming REPA, ReDi, and REG. REGLUE (SiT-XL/2) matches 1M-step SOTA performance in just 700k iterations (~30% fewer steps).
REGLUE (SiT-B/2) achieves 12.9 and 28.7 FID at 400K iterations in conditional and unconditional generation, respectively, outperforming REPA, ReDi, and REG. REGLUE (SiT-XL/2) matches 1M-step SOTA performance in just 700k iterations (~30% fewer steps).
External alignment complements joint modeling, but its benefits depend on the signal. Local alignment yields consistent gains, whereas global-only alignment can degrade performance. Spatial joint modeling remains the primary driver.
External alignment complements joint modeling, but its benefits depend on the signal. Local alignment yields consistent gains, whereas global-only alignment can degrade performance. Spatial joint modeling remains the primary driver.
Our analysis shows that jointly modeling with patch-level semantics drives most gains. The global [CLS] helps, but fine-grained spatial features deliver a strongly larger FID improvement, highlighting the importance of local structure for diffusion.
Our analysis shows that jointly modeling with patch-level semantics drives most gains. The global [CLS] helps, but fine-grained spatial features deliver a strongly larger FID improvement, highlighting the importance of local structure for diffusion.
Do compressed patch features retain VFM semantics?
Points show frozen compressed DINOv2 semantics (x: ImageNet top-1 / Cityscapes mIoU) vs SiT-B generation quality (y: ImageNet FID) when trained on VAE latents + compressed features.
Do compressed patch features retain VFM semantics?
Points show frozen compressed DINOv2 semantics (x: ImageNet top-1 / Cityscapes mIoU) vs SiT-B generation quality (y: ImageNet FID) when trained on VAE latents + compressed features.
Linear PCA can limit patch-level semantics (e.g., ReDi). We introduce a lightweight non-linear semantic compressor that aggregates multi-layer VFM features into a compact, semantics-preserving space, boosting quality (21.4 → 13.3 FID).
Linear PCA can limit patch-level semantics (e.g., ReDi). We introduce a lightweight non-linear semantic compressor that aggregates multi-layer VFM features into a compact, semantics-preserving space, boosting quality (21.4 → 13.3 FID).
REGLUE puts these into one unified model and jointly models:
1️⃣ VAE latents (pixels)
2️⃣ local semantics (compressed patch features)
3️⃣ global [CLS] (concept)
➕ alignment loss as a complementary auxiliary boost.
REGLUE puts these into one unified model and jointly models:
1️⃣ VAE latents (pixels)
2️⃣ local semantics (compressed patch features)
3️⃣ global [CLS] (concept)
➕ alignment loss as a complementary auxiliary boost.
Jointly modeling compressed patch-level semantics ➕ VAE latents provides spatial guidance and yields larger gains than alignment-only (REPA) or global-only (REG).
Alignment loss and a global [CLS] token stay complementary, orthogonal signals.
Jointly modeling compressed patch-level semantics ➕ VAE latents provides spatial guidance and yields larger gains than alignment-only (REPA) or global-only (REG).
Alignment loss and a global [CLS] token stay complementary, orthogonal signals.
To leverage VFMs effectively, diffusion should jointly model VAE latents with multi-layer VFM spatial (patch-level) semantics, via a compact, non-linearly compressed representation.
To leverage VFMs effectively, diffusion should jointly model VAE latents with multi-layer VFM spatial (patch-level) semantics, via a compact, non-linearly compressed representation.
Existing joint modeling and external alignment approaches (e.g., REPA, REG) inject only a “narrow slice” of VFM features into diffusion. We argue richer semantics are needed to unlock their full potential.
Existing joint modeling and external alignment approaches (e.g., REPA, REG) inject only a “narrow slice” of VFM features into diffusion. We argue richer semantics are needed to unlock their full potential.
📄 arxiv.org/abs/2510.25387
🧪 github.com/billpsomas/i...
George Retsinas, @nikos-efth.bsky.social, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, @gtolias.bsky.social.
📄 arxiv.org/abs/2510.25387
🧪 github.com/billpsomas/i...
George Retsinas, @nikos-efth.bsky.social, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, @gtolias.bsky.social.
⚡BASIC: training-free pipeline (centering, projection with PCA, textual contextualization, Harris-style fusion) with strong results across i-CIR and class-level CIR benchmarks.
⚡BASIC: training-free pipeline (centering, projection with PCA, textual contextualization, Harris-style fusion) with strong results across i-CIR and class-level CIR benchmarks.
📊~750K images, 202 instances, ~1,900 composed queries. Despite small per-query DBs (~3.7K images), i-CIR matches the difficulty of searching with >40M random distractors.
📊~750K images, 202 instances, ~1,900 composed queries. Despite small per-query DBs (~3.7K images), i-CIR matches the difficulty of searching with >40M random distractors.