Lightnews — Scholar-powered news

Thomas Wimmer @wimmerthomas.bsky.social · 1d

Happy to find my name on the list of outstanding reviewers :]

Come and check out our poster on learning better features for semantic correspondence in Hawaii!

📍 Poster #538 (Session 2)
🗓️ Oct 21 | 3:15 – 5:00 p.m. HST

genintel.github.io/DIY-SC

#ICCV2025 @iccv.bsky.social · 6d

There’s no conference without the efforts of our reviewers. Special shoutout to our #ICCV2025 outstanding reviewers 🫡

iccv.thecvf.com/Conferences/...

2025 ICCV Program Committee

iccv.thecvf.com

2

Thomas Wimmer @wimmerthomas.bsky.social · Aug 21

What was the patch size used here?

1

Thomas Wimmer @wimmerthomas.bsky.social · Jun 26

All the links can be found here. Great collaborators!

bsky.app/profile/odue...

Olaf Dünkel @oduenkel.bsky.social · Jun 26

🔗Project page: genintel.github.io/DIY-SC
📄Paper: arxiv.org/pdf/2506.05312
💻Code: github.com/odunkel/DIY-SC
🤗Demo: huggingface.co/spaces/odunk...

Great collaboration with @wimmerthomas.bsky.social , Christian Theobalt, Christian Rupprecht, and @adamkortylewski.bsky.social ! [6/6]

2

Thomas Wimmer @wimmerthomas.bsky.social · Jun 26

🚀 Just accepted to ICCV 2025!

In DIY-SC, we improve foundational features using a light-weight adapter trained with carefully filtered and refined pseudo-labels.

🔧 Drop-in alternative to plain DINOv2 features!
📦 Code + pre-trained weights available now.
🔥 Try it in your next vision project!

Olaf Dünkel @oduenkel.bsky.social · Jun 26

Are you using DINOv2 for tasks that require semantic features? DIY-SC might be the alternative!
It refines DINOv2 or SD+DINOv2 features and achieves a new SOTA on the semantic correspondence dataset SPair-71k when not relying on annotated keypoints! [1/6]
genintel.github.io/DIY-SC

1 2 10

Thomas Wimmer @wimmerthomas.bsky.social · Jun 11

The CVML group at the @mpi-inf.mpg.de has been busy for CVPR. Check out our papers and come by the presentations!

Computer Vision and Machine Learning at MPI Informatics @cvml.mpi-inf.mpg.de · Jun 11

🎉 Exciting News #CVPR2025!

We’re proud to announce that we have 5 papers accepted to the main conference and 7 papers accepted at various CVPR workshops this year!

We’re looking forward to sharing our research with the community in Nashville!

Stay tuned for more details! ‪‪@mpi-inf.mpg.de‬

1 4

Reposted by Thomas Wimmer

Computer Vision and Machine Learning at MPI Informatics @cvml.mpi-inf.mpg.de · Apr 9

Hello world, we are now on Bluesky 🦋! Follow us to receive updates on exciting research and projects from our group!

#computervision #machinelearning #research

4 10

Thomas Wimmer @wimmerthomas.bsky.social · Mar 28

We only use open-sourced models and the implementation of our method is readily available. Please check out the paper website for more details:

wimmerth.github.io/gaussians2li...

Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

We introduce a method to animate given 3D scenes that uses pre-trained models to lift 2D motion into 3D. We propose a training-free, autoregressive method to generate more 3D-consi...

wimmerth.github.io

1

Thomas Wimmer @wimmerthomas.bsky.social · Mar 28

We can animate arbitrary 3D scenes within 10 minutes on a RTX4090 while keeping scene appearance and geometry in tact.

Note, that since the time I worked on this, open-sourced video diffusion models have improved significantly, which will directly improve the results of this method as well.

🧵⬇️

1

Thomas Wimmer @wimmerthomas.bsky.social · Mar 28

While we can now transfer motion into 3D, we still have to deal with a fundamental problem: Lacking 3D consistency of generated videos.
With limited resources, we can't fine-tune or retrain a VDM to be pose-conditioned. Thus, we propose a zero-shot technique to generate more 3D-consistent videos!
🧵⬇️

$Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene f, using the rendering function g from the current viewpoint g(f)_s, we compute the latent embedding of the warped video output v_{s-1} of the previous optimization step (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output v_s.$

1

Thomas Wimmer @wimmerthomas.bsky.social · Mar 28

Standard practices like SDS fail for this task as VDMs provide a guidance signal that is too noisy, resulting in "exploding" scenes.

Instead, we propose to employ several pre-trained 2D models to directly lift motion from tracked points in the generated videos to 3D Gaussians.

🧵⬇️

Method overview for lifting 2D dynamics into 3D. Pre-trained models are shown in blue. We detect 2D point tracks and use aligned estimated depth values to lift them into 3D.
The 4D (dynamic 3D) Gaussians are initialized with the static 3D scene input.

1 1

Thomas Wimmer @wimmerthomas.bsky.social · Mar 28

Had the honor to present "Gaussians-to-Life" at #3DV2025 yesterday. In this work, we used video diffusion models to animate arbitrary 3D Gaussian Splatting scenes.
This work was a great collaboration with @moechsle.bsky.social, @miniemeyer.bsky.social, and Federico Tombari.

🧵⬇️

2 1 13

Thomas Wimmer @wimmerthomas.bsky.social · Mar 3

Can you do reasoning with diffusion models?

The answer is yes!

Take a look at Spatial Reasoning Models. Hats off for this amazing work!

Jan Eric Lenssen @janericlenssen.bsky.social · Mar 3

Can image generators solve visual Sudoku?

Naively, no, with sequentialization and the correct order, they can!

Check out @chriswewer.bsky.social's and Bart's SRM's for details.

Project: geometric-rl.mpi-inf.mpg.de/srm/
Paper: arxiv.org/abs/2502.21075
Code: github.com/Chrixtar/SRM

3

Thomas Wimmer @wimmerthomas.bsky.social · Feb 14

I wonder to which degree one could artificially make real images (with GT depth) more abstract during training in order to make depth models learn these priors that we would have (like green=field, blue=sky) and whether that would actually give us any benefit, like increased robustness...

1 1

Thomas Wimmer @wimmerthomas.bsky.social · Feb 14

Ah, thanks, I overlooked that :)

1 1

Thomas Wimmer @wimmerthomas.bsky.social · Feb 14

Nice experiments! What model did you use?

1 1

Reposted by Thomas Wimmer

Visual Inference Lab @visinf.bsky.social · Jan 31

🏔️⛷️ Looking back on a fantastic week full of talks, research discussions, and skiing in the Austrian mountains!

11 31

Thomas Wimmer @wimmerthomas.bsky.social · Jan 16

Give a warm welcome to @janericlenssen.bsky.social!

Jan Eric Lenssen @janericlenssen.bsky.social · Jan 15

Hello bluesky-world :)

Introducing 𝗠𝗘𝘁𝟯𝗥: 𝗠𝗲𝗮𝘀𝘂𝗿𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶-𝗩𝗶𝗲𝘄 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 𝗶𝗻 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲𝗱 𝗜𝗺𝗮𝗴𝗲𝘀.

Lacking 3D consistency in generated images is a limitation of many current multi-view/video/world generative models. To quantitatively measure these inconsistencies, check out Mohammad Asims new work!

2

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

Well well, it turns out that GIFs aren't yet supported on this platform. Here is the teaser video as an MP4 instead:

1

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

This work was led by @mohammadasim98.bsky.social and is a collaboration with Christopher Wewer, Bernt Schiele and Jan Eric Lenssen.

Check out the website with lots of nice visuals that show how our metric works and use it in your next diffusion model project!

geometric-rl.mpi-inf.mpg.de/met3r/

MEt3R

Measuring Multi-View Consistency in Generated Images.

geometric-rl.mpi-inf.mpg.de

1 2

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

Important note: Our metric is not here to measure the visual quality / appearance of generated content. It is, instead, meant to act orthogonal to existing image quality metrics by focusing on the 3D consistency of generated frames.

1

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

Especially for video generation methods where no ground truth camera poses are given, our proposed metric can help to shed light on the quality of the generated videos, rather than just reporting results from yet another human survey.

1

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

Speaking of multi-view diffusion models, we also trained a new open-source multi-view latent diffusion model built on top of Stable Diffusion and heavily inspired by the closed-source CAT3D model.

Weights and code are already public. Check it out!

github.com/mohammadasim...

1

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

The MET3R scores correlate well with the 3D awareness of different multi-view image generation methods, as we show in our experiments. The metric is also differentiable, which means you could use it even for training! The code is easy to run and already open-sourced!

github.com/mohammadasim...

1

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

We propose MET3R, a new metric for measuring multi-view consistency in generated images. Our method is built upon DUSt3R and evaluates the consistency of projected DINO features between two views. It is able to accurately capture the 3D consistency in generated images.

1 2 5

Thomas Wimmer @wimmerthomas.bsky.social · Jan 15

Quantitative evaluation of diffusion model outputs is hard!

We realized that we are often lacking metrics for comparing the quality of video and multi-view diffusion models. Especially the quantification of multi-view 3D consistency across frames is difficult.

But not anymore: Introducing MET3R 🧵

2 2