Lightnews — Scholar-powered news

Sai Kumar Dwivedi

@saidwivedi.in

930 followers 630 following 17 posts

PhD Candidate at @MPI-IS || 3D Vision & Digital Avatars || Ex: @Daimler, @Intel

Webpage: https://saidwivedi.in

Posts Replies Media Videos

Sai Kumar Dwivedi

@saidwivedi.in

InteractVLM (#CVPR2025) is a great collaboration MPI-IS, UvA and Inria.
Authors: @saidwivedi.in, @anticdimi.bsky.social, S. Tripathi, O. Taheri, C. Schmid, @michael-j-black.bsky.social and @dimtzionas.bsky.social.
Code & models available at: interactvlm.is.tue.mpg.de (10/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

InteractVLM is the first method that infers 3D contacts on both humans and objects from in-the-wild images, and exploits these for 3D reconstruction via an optimization pipeline. In contrast, existing methods like PHOSA rely on handcrafted or heuristic-based contacts. (9/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

With just 5% of DAMON’s 3D body contact annotations, InteractVLM surpasses the fully-supervised DECO baseline trained on 100% of 3D annotations. This is promising for minimizing reliance on costly 3D data by using foundational models. (8/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

InteractVLM also shows strong outperformance on object affordance prediction on the PIAD dataset. Here affordance is defined as contact probabilities on the object. (7/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

InteractVLM significantly outperforms prior work, both qualitatively and quantitatively, on in-the-wild 3D human (binary & semantic) contact prediction on the DAMON dataset. (6/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

To bridge this 2D-to-3D gap, we propose "Render-Localize-Lift":
- Render: 3D human/object meshes into multiview 2D images.
- Localize: A Multiview Localization (MV-Loc) model, guided by VLM tokens, predicts 2D contact masks.
- Lift: 2D contact masks to 3D.
(5/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

How can we infer 3D contact with limited 3D data? InteractVLM exploits foundational models—a VLM & localization model fine tuned to reason about contact. Given an image & prompt, the VLM outputs tokens for localization. But these models work in 2D, while contact is 3D. (4/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

Furthermore, simple binary contact (touching “any” object) misses the rich semantics of real multi-object interactions. Thus, we introduce a novel task - "Semantic Human Contact" estimation: predicting contact points on a human related to a specified object. (3/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

Precisely inferring where humans contact objects from an image is hard due to occlusion & depth ambiguity. Current datasets of images with 3D contact are small as they’re costly & tedious to create (mocap/manual labeling), limiting performance of contact predictors. (2/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

Why does 3D human-object reconstruction fail in the wild or get limited to a few object classes? A key missing piece is accurate 3D contact. InteractVLM (#CVPR2025) uses foundational models to infer contact on humans & objects, improving reconstruction from a single image. (1/10)

June 15, 2025 at 12:23 PM

Sai Kumar Dwivedi

@saidwivedi.in

I've been using GitHub's Lists feature for over a year, and it's seriously underrated! ⭐

It lets you assign labels to all your starred repos, making it super easy to find projects later based on specific fields or topics. No more endless scrolling!

Link to my list: github.com/saidwivedi?t...

March 12, 2025 at 8:37 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news