Thodoris Kouzelis
@nicolabourbaki.bsky.social
35 followers 38 following 21 posts
1st year PhD Candidate Archimedes, Athena RC & NTUA
Posts Media Videos Starter Packs
nicolabourbaki.bsky.social
10/n We apply PCA to DINOv2 to retain expressivity without dominating model capacity. Just a few PCs suffice to significantly boost generative performance.
nicolabourbaki.bsky.social
9/n Unconditional generation gets a huge upgrade too. ReDi + Representation Guidance (RG) nearly closes the gap with conditional models. E.g., unconditional DiT-XL/2 with ReDi+RG hits FID 22.6, close to class-conditioned DiT-XL’s FID 19.5! 💪
nicolabourbaki.bsky.social
8/n ReDi delivers delivers state-of-the-art results with exceptional generation performance, across the board. 🔥
nicolabourbaki.bsky.social
7/n Training speed? Massive improvements for both DiT and SiT:
~23x faster convergence than baseline DiT/SiT.
~6x faster than REPA.🚀
nicolabourbaki.bsky.social
6/n ReDi requires no extra distillation losses, just pure diffusion, significantly simplifying training. Plus, it unlocks Representation Guidance (RG), a new inference strategy that uses learned semantics to steer and refine image generation. 🎯
nicolabourbaki.bsky.social
5/n We explore two ways to fuse tokens for image latents & features
- Merged Tokens (MR): Efficient, keeps token count constant
- Separate Tokens (SP): More expressive, ~2x compute
Both boost performance, but MR hits the sweet spot for speed vs. quality.
nicolabourbaki.bsky.social
4/n Integrating ReDi into DiT/SiT-style architectures is seamless
- Apply noise to both image latents and semantic features
- Fuse them into one token sequence
- Denoise both with standard DiT/SiT
That’s it.
nicolabourbaki.bsky.social
3/n ReDi builds on the insight that some latent representations are inherently easier to model (h/t @sedielem.bsky.social's blog), enabling a unified dual-space diffusion approach that generates coherent image–feature pairs from pure noise.
nicolabourbaki.bsky.social
2/n The result?
🔗 A powerful new method for generative image modeling that bridges generation and representation learning.
⚡️Brings massive gains in performance/training efficiency and a new paradigm for representation-aware generative modeling.
nicolabourbaki.bsky.social
1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture
– Low-level image details (via VAE latents)
– High-level semantic features (via DINOv2)🧵
nicolabourbaki.bsky.social
9/n How fast does EQ-VAE refine the latents?
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.
nicolabourbaki.bsky.social
8/n Why does EQ-VAE help so much?
We find a strong correlation between latent space complexity and generative performance.
🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold.
🔹 This makes the latent space simpler and easier to model.
nicolabourbaki.bsky.social
7/n Performance gains across the board:
✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
✅ REPA: Training time 4M → 1M iterations (4× speedup)
✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)
nicolabourbaki.bsky.social
6/n EQ-VAE provides a plug-and-play enhancement — no architectural changes are needed, working seamlessly with:
✅ Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
✅ Discrete autoencoders (VQ-GAN)
nicolabourbaki.bsky.social
5/n EQ-VAE fixes this by introducing a simple regularization objective:
👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.
nicolabourbaki.bsky.social
4/n The motivation:
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
✅ If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.
nicolabourbaki.bsky.social
3/n Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
✅ 7× faster training convergence on DiT-XL/2
✅ 4× faster training on REPA
nicolabourbaki.bsky.social
2/n Why EQ-VAE?
🔹Smoother latent space = easier to model & better generative performance.
🔹No trade-off in reconstruction quality—rFID improves too!
🔹Works as a plug-and-play enhancement—no architectural changes needed!
nicolabourbaki.bsky.social
1/n🚀If you’re working on generative image modeling, check out our latest work! We introduce EQ-VAE, a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.👇