Lightnews — Scholar-powered news

Xi WANG

@xiwang92.bsky.social

33 followers 40 following 8 posts

Ecole Polytechnique, IP Paris; Prev. Ph.D.@Univ Rennes, Inria/IRISA https://triocrossing.github.io/

triocrossing.github.io

Posts Media Videos Starter Packs

Reposted by Xi WANG

Christian Wolf @chriswolfvision.bsky.social · Jun 6

CVPR@Paris opening speech at Sorbonne University by @davidpicard.bsky.social , @vickykalogeiton.bsky.social and Matthieu Cord.

Great location!

❤️

(also: free food as at 'real' CVPR)

7 36

Xi WANG @xiwang92.bsky.social · Mar 21

For more details, visit the project website: yuanzhi-zhu.github.io/DiMO/
Or read the paper: arxiv.org/abs/2503.15457
The project is led by Yuanzhi Zhu (yuanzhi-zhu.github.io/about/) and supervised by @stephlat.bsky.social and @vickykalogeiton.bsky.social.

Di[M]O: Distilling Masked Diffusion Models into One-step Generator

SOCIAL MEDIA DESCRIPTION TAG TAG

yuanzhi-zhu.github.io

1 1

Xi WANG @xiwang92.bsky.social · Mar 21

We test Di[M]O on image generation with MaskGit & Meissonic as teacher models.
- First one-step MDM that competes with multi-step teachers
- A significant speed-up of 8 to 32 times without degradation in quality.
- The first successful distillation approach for text-to-image MDMs.

Xi WANG @xiwang92.bsky.social · Mar 21

Our approach fundamentally differs from previous distillation methods, such as DMD. Instead of minimizing the divergence of denoising distributions across the entire latent space, Di[M]O optimizes the divergence of token-level conditional distributions.

Xi WANG @xiwang92.bsky.social · Mar 21

To approximate the loss gradient, we introduce an auxiliary model that estimates an otherwise intractable term in the loss function. The auxiliary model is trained using a standard MDM training loss, with one-step generated samples as targets.

Xi WANG @xiwang92.bsky.social · Mar 21

To sample from the correct joint distribution, we introduce an initialization that maps a randomized input sequence to an almost deterministic target sequence.
Without proper initialization, the model may suffer from divergence or mode collapse, making this step essential.

Xi WANG @xiwang92.bsky.social · Mar 21

The initial distribution is crucial here. As pointed out by
Jiaming Song, in his recent position paper arxiv.org/abs/2503.07154, multi-token prediction is inherently difficult due to the independence assumption between the predicted tokens.

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for di...

arxiv.org

1 1

Xi WANG @xiwang92.bsky.social · Mar 21

The key idea is inspired by on-policy distillation. We align the output distributions of the teacher and student models at the student generated intermediate states, ensuring that the student's generation closely matches the teacher's by covering all possible intermediate states.

1 1

Xi WANG @xiwang92.bsky.social · Mar 21

Masked Diffusion Models (MDMs) are a hot topic in generative AI 🔥 — powerful but slow due to multiple sampling steps.
We @polytechniqueparis.bsky.social and @inria-grenoble.bsky.social introduce Di[M]O — a novel approach to distill MDMs into a one-step generator without sacrificing quality.

1 3 8