Lightnews — Scholar-powered news

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

11/n Joint work with @sta8is.bsky.social @ikakogeorgiou.bsky.social , @spyrosgidaris.bsky.social , Nikos Komodakis
Paper: arxiv.org/abs/2504.16064
Code: github.com/zelaki/ReDi

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image model...

arxiv.org

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

10/n We apply PCA to DINOv2 to retain expressivity without dominating model capacity. Just a few PCs suffice to significantly boost generative performance.

1 1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

9/n Unconditional generation gets a huge upgrade too. ReDi + Representation Guidance (RG) nearly closes the gap with conditional models. E.g., unconditional DiT-XL/2 with ReDi+RG hits FID 22.6, close to class-conditioned DiT-XL’s FID 19.5! 💪

1 1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

8/n ReDi delivers delivers state-of-the-art results with exceptional generation performance, across the board. 🔥

1 2

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

7/n Training speed? Massive improvements for both DiT and SiT:
~23x faster convergence than baseline DiT/SiT.
~6x faster than REPA.🚀

1 1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

6/n ReDi requires no extra distillation losses, just pure diffusion, significantly simplifying training. Plus, it unlocks Representation Guidance (RG), a new inference strategy that uses learned semantics to steer and refine image generation. 🎯

1 2

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

5/n We explore two ways to fuse tokens for image latents & features
- Merged Tokens (MR): Efficient, keeps token count constant
- Separate Tokens (SP): More expressive, ~2x compute
Both boost performance, but MR hits the sweet spot for speed vs. quality.

1 2

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

4/n Integrating ReDi into DiT/SiT-style architectures is seamless
- Apply noise to both image latents and semantic features
- Fuse them into one token sequence
- Denoise both with standard DiT/SiT
That’s it.

1 1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

3/n ReDi builds on the insight that some latent representations are inherently easier to model (h/t @sedielem.bsky.social's blog), enabling a unified dual-space diffusion approach that generates coherent image–feature pairs from pure noise.

1 1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

2/n The result?
🔗 A powerful new method for generative image modeling that bridges generation and representation learning.
⚡️Brings massive gains in performance/training efficiency and a new paradigm for representation-aware generative modeling.

1 1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Apr 25

1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture
– Low-level image details (via VAE latents)
– High-level semantic features (via DINOv2)🧵

1 3 21

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

10/n
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social and Nikos Komodakis
Paper: arxiv.org/abs/2502.09509
Code: github.com/zelaki/eqvae
HuggingFace Model: huggingface.co/zelaki/eq-va...

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model...

arxiv.org

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

9/n How fast does EQ-VAE refine the latents?
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

8/n Why does EQ-VAE help so much?
We find a strong correlation between latent space complexity and generative performance.
🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold.
🔹 This makes the latent space simpler and easier to model.

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

7/n Performance gains across the board:
✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
✅ REPA: Training time 4M → 1M iterations (4× speedup)
✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)

1 1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

6/n EQ-VAE provides a plug-and-play enhancement — no architectural changes are needed, working seamlessly with:
✅ Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
✅ Discrete autoencoders (VQ-GAN)

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

5/n EQ-VAE fixes this by introducing a simple regularization objective:
👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

4/n The motivation:
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
✅ If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

3/n Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
✅ 7× faster training convergence on DiT-XL/2
✅ 4× faster training on REPA

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

2/n Why EQ-VAE?
🔹Smoother latent space = easier to model & better generative performance.
🔹No trade-off in reconstruction quality—rFID improves too!
🔹Works as a plug-and-play enhancement—no architectural changes needed!

1

Thodoris Kouzelis @nicolabourbaki.bsky.social · Feb 18

1/n🚀If you’re working on generative image modeling, check out our latest work! We introduce EQ-VAE, a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.👇

1 8 18