Interested in extracting world understanding from models and more controlled generation. 🌐 https://stefan-baumann.eu/
Let's make everything generative! No reason to forgo the availability of an (at least implicit) distribution for every prediction to make, if we can make it at least as accurate and similarly efficient as discriminative baselines in the long run
Let's make everything generative! No reason to forgo the availability of an (at least implicit) distribution for every prediction to make, if we can make it at least as accurate and similarly efficient as discriminative baselines in the long run
We'll also be in Honolulu to present the paper at #ICCV2025 next week 🌺.
Take a look now!
🌐 Project Page: compvis.github.io/flow-poke-tr...
📝 Paper: arxiv.org/abs/2510.12777
💻 Code & Weights: github.com/CompVis/flow...
We'll also be in Honolulu to present the paper at #ICCV2025 next week 🌺.
Take a look now!
🌐 Project Page: compvis.github.io/flow-poke-tr...
📝 Paper: arxiv.org/abs/2510.12777
💻 Code & Weights: github.com/CompVis/flow...
@rmsnorm.bsky.social, @timyphan.bsky.social, and Björn Ommer at @compvis.bsky.social. A giant thank you to them! ❤️
@rmsnorm.bsky.social, @timyphan.bsky.social, and Björn Ommer at @compvis.bsky.social. A giant thank you to them! ❤️
• Articulated motion (Drag-A-Move): fine-tuned FPT outperforms specialized models for motion prediction
• Face motion: zero-shot, beats specialized baselines
• Moving part segmentation: emerges from formulation
• Articulated motion (Drag-A-Move): fine-tuned FPT outperforms specialized models for motion prediction
• Face motion: zero-shot, beats specialized baselines
• Moving part segmentation: emerges from formulation
FPT 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘴 𝘵𝘩𝘦 𝘧𝘶𝘭𝘭 𝘮𝘰𝘵𝘪𝘰𝘯 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯, enabling:
• interpretable uncertainty
• controllable interaction effects
• efficient prediction (>100k predictions/s)
FPT 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘴 𝘵𝘩𝘦 𝘧𝘶𝘭𝘭 𝘮𝘰𝘵𝘪𝘰𝘯 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯, enabling:
• interpretable uncertainty
• controllable interaction effects
• efficient prediction (>100k predictions/s)
Predict 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀 of motion, not just one flow field instance.
Given a few pokes, our model outputs the probability 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of how parts of the scene might move.
→ This directly captures 𝘶𝘯𝘤𝘦𝘳𝘵𝘢𝘪𝘯𝘵𝘺 and interactions.
Predict 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀 of motion, not just one flow field instance.
Given a few pokes, our model outputs the probability 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of how parts of the scene might move.
→ This directly captures 𝘶𝘯𝘤𝘦𝘳𝘵𝘢𝘪𝘯𝘵𝘺 and interactions.
But most models predict 𝗼𝗻𝗲 𝗳𝘂𝘁𝘂𝗿𝗲, a single deterministic motion.
The reality is 𝘶𝘯𝘤𝘦𝘳𝘵𝘢𝘪𝘯 and 𝘮𝘶𝘭𝘵𝘪-𝘮𝘰𝘥𝘢𝘭: one poke can lead to many outcomes.
But most models predict 𝗼𝗻𝗲 𝗳𝘂𝘁𝘂𝗿𝗲, a single deterministic motion.
The reality is 𝘶𝘯𝘤𝘦𝘳𝘵𝘢𝘪𝘯 and 𝘮𝘶𝘭𝘵𝘪-𝘮𝘰𝘥𝘢𝘭: one poke can lead to many outcomes.
The model has never been trained for it, and, while it's obvious that related capabilities might be useful for denoising, I'd still consider this an emergent capability
The model has never been trained for it, and, while it's obvious that related capabilities might be useful for denoising, I'd still consider this an emergent capability