Thomas Wimmer
@wimmerthomas.bsky.social
400 followers 140 following 24 posts
PhD Candidate at the Max Planck ETH Center for Learning Systems working on 3D Computer Vision. https://wimmerth.github.io
Posts Media Videos Starter Packs
wimmerthomas.bsky.social
Happy to find my name on the list of outstanding reviewers :]

Come and check out our poster on learning better features for semantic correspondence in Hawaii!

📍 Poster #538 (Session 2)
🗓️ Oct 21 | 3:15 – 5:00 p.m. HST

genintel.github.io/DIY-SC
iccv.bsky.social
There’s no conference without the efforts of our reviewers. Special shoutout to our #ICCV2025 outstanding reviewers 🫡

iccv.thecvf.com/Conferences/...
2025 ICCV Program Committee
iccv.thecvf.com
wimmerthomas.bsky.social
What was the patch size used here?
wimmerthomas.bsky.social
All the links can be found here. Great collaborators!

bsky.app/profile/odue...
oduenkel.bsky.social
🔗Project page: genintel.github.io/DIY-SC
📄Paper: arxiv.org/pdf/2506.05312
💻Code: github.com/odunkel/DIY-SC
🤗Demo: huggingface.co/spaces/odunk...

Great collaboration with @wimmerthomas.bsky.social , Christian Theobalt, Christian Rupprecht, and @adamkortylewski.bsky.social ! [6/6]
wimmerthomas.bsky.social
🚀 Just accepted to ICCV 2025!

In DIY-SC, we improve foundational features using a light-weight adapter trained with carefully filtered and refined pseudo-labels.

🔧 Drop-in alternative to plain DINOv2 features!
📦 Code + pre-trained weights available now.
🔥 Try it in your next vision project!
oduenkel.bsky.social
Are you using DINOv2 for tasks that require semantic features? DIY-SC might be the alternative!
It refines DINOv2 or SD+DINOv2 features and achieves a new SOTA on the semantic correspondence dataset SPair-71k when not relying on annotated keypoints! [1/6]
genintel.github.io/DIY-SC
wimmerthomas.bsky.social
The CVML group at the @mpi-inf.mpg.de has been busy for CVPR. Check out our papers and come by the presentations!
cvml.mpi-inf.mpg.de
🎉 Exciting News #CVPR2025!

We’re proud to announce that we have 5 papers accepted to the main conference and 7 papers accepted at various CVPR workshops this year!

We’re looking forward to sharing our research with the community in Nashville!

Stay tuned for more details! ‪‪@mpi-inf.mpg.de‬
Reposted by Thomas Wimmer
cvml.mpi-inf.mpg.de
Hello world, we are now on Bluesky 🦋! Follow us to receive updates on exciting research and projects from our group!

#computervision #machinelearning #research
wimmerthomas.bsky.social
We can animate arbitrary 3D scenes within 10 minutes on a RTX4090 while keeping scene appearance and geometry in tact.

Note, that since the time I worked on this, open-sourced video diffusion models have improved significantly, which will directly improve the results of this method as well.

🧵⬇️
wimmerthomas.bsky.social
While we can now transfer motion into 3D, we still have to deal with a fundamental problem: Lacking 3D consistency of generated videos.
With limited resources, we can't fine-tune or retrain a VDM to be pose-conditioned. Thus, we propose a zero-shot technique to generate more 3D-consistent videos!
🧵⬇️
Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene f, using the rendering function g from the current viewpoint g(f)_s, we compute the latent embedding of the warped video output v_{s-1} of the previous optimization step (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output v_s.
wimmerthomas.bsky.social
Standard practices like SDS fail for this task as VDMs provide a guidance signal that is too noisy, resulting in "exploding" scenes.

Instead, we propose to employ several pre-trained 2D models to directly lift motion from tracked points in the generated videos to 3D Gaussians.

🧵⬇️
Method overview for lifting 2D dynamics into 3D. Pre-trained models are shown in blue. We detect 2D point tracks and use aligned estimated depth values to lift them into 3D.
The 4D (dynamic 3D) Gaussians are initialized with the static 3D scene input.
wimmerthomas.bsky.social
Had the honor to present "Gaussians-to-Life" at #3DV2025 yesterday. In this work, we used video diffusion models to animate arbitrary 3D Gaussian Splatting scenes.
This work was a great collaboration with @moechsle.bsky.social, @miniemeyer.bsky.social, and Federico Tombari.

🧵⬇️
wimmerthomas.bsky.social
Can you do reasoning with diffusion models?

The answer is yes!

Take a look at Spatial Reasoning Models. Hats off for this amazing work!
janericlenssen.bsky.social
Can image generators solve visual Sudoku?

Naively, no, with sequentialization and the correct order, they can!

Check out @chriswewer.bsky.social's and Bart's SRM's for details.

Project: geometric-rl.mpi-inf.mpg.de/srm/
Paper: arxiv.org/abs/2502.21075
Code: github.com/Chrixtar/SRM
wimmerthomas.bsky.social
I wonder to which degree one could artificially make real images (with GT depth) more abstract during training in order to make depth models learn these priors that we would have (like green=field, blue=sky) and whether that would actually give us any benefit, like increased robustness...
wimmerthomas.bsky.social
Ah, thanks, I overlooked that :)
wimmerthomas.bsky.social
Nice experiments! What model did you use?
Reposted by Thomas Wimmer
visinf.bsky.social
🏔️⛷️ Looking back on a fantastic week full of talks, research discussions, and skiing in the Austrian mountains!
wimmerthomas.bsky.social
Give a warm welcome to @janericlenssen.bsky.social!
janericlenssen.bsky.social
Hello bluesky-world :)

Introducing 𝗠𝗘𝘁𝟯𝗥: 𝗠𝗲𝗮𝘀𝘂𝗿𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶-𝗩𝗶𝗲𝘄 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 𝗶𝗻 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲𝗱 𝗜𝗺𝗮𝗴𝗲𝘀.

Lacking 3D consistency in generated images is a limitation of many current multi-view/video/world generative models. To quantitatively measure these inconsistencies, check out Mohammad Asims new work!
wimmerthomas.bsky.social
Well well, it turns out that GIFs aren't yet supported on this platform. Here is the teaser video as an MP4 instead:
wimmerthomas.bsky.social
This work was led by @mohammadasim98.bsky.social and is a collaboration with Christopher Wewer, Bernt Schiele and Jan Eric Lenssen.

Check out the website with lots of nice visuals that show how our metric works and use it in your next diffusion model project!

geometric-rl.mpi-inf.mpg.de/met3r/
MEt3R
Measuring Multi-View Consistency in Generated Images.
geometric-rl.mpi-inf.mpg.de
wimmerthomas.bsky.social
Important note: Our metric is not here to measure the visual quality / appearance of generated content. It is, instead, meant to act orthogonal to existing image quality metrics by focusing on the 3D consistency of generated frames.
wimmerthomas.bsky.social
Especially for video generation methods where no ground truth camera poses are given, our proposed metric can help to shed light on the quality of the generated videos, rather than just reporting results from yet another human survey.
wimmerthomas.bsky.social
Speaking of multi-view diffusion models, we also trained a new open-source multi-view latent diffusion model built on top of Stable Diffusion and heavily inspired by the closed-source CAT3D model.

Weights and code are already public. Check it out!

github.com/mohammadasim...
wimmerthomas.bsky.social
The MET3R scores correlate well with the 3D awareness of different multi-view image generation methods, as we show in our experiments. The metric is also differentiable, which means you could use it even for training! The code is easy to run and already open-sourced!

github.com/mohammadasim...
wimmerthomas.bsky.social
We propose MET3R, a new metric for measuring multi-view consistency in generated images. Our method is built upon DUSt3R and evaluates the consistency of projected DINO features between two views. It is able to accurately capture the 3D consistency in generated images.
wimmerthomas.bsky.social
Quantitative evaluation of diffusion model outputs is hard!

We realized that we are often lacking metrics for comparing the quality of video and multi-view diffusion models. Especially the quantification of multi-view 3D consistency across frames is difficult.

But not anymore: Introducing MET3R 🧵