Lightnews — Scholar-powered news

Jianyuan Wang

@jianyuanwang.bsky.social

96 followers 110 following 12 posts

https://jytime.github.io/

jytime.github.io

Posts Media Videos Starter Packs

Jianyuan Wang @jianyuanwang.bsky.social · Mar 18

I am trying to. Probably we could hear about this around next submission ddl 😂

1 1

Jianyuan Wang @jianyuanwang.bsky.social · Mar 18

It seems so (with a short glance only). The techniques used by Fast3R can also be applied to VGGT

Jianyuan Wang @jianyuanwang.bsky.social · Mar 18

Haha, this probably serves as an indirect validation of NVIDIA’s stock value.

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

Currently, this training approach is not very stable, but I believe that’s likely because I haven’t yet found the correct training method. I hope this can achieve better results in the future, which could then avoid an explicit modelling of point map.

1 2

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

Finally, great work together with Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny!

@oxford-vgg.bsky.social

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

Interesting observation: VGGT’s camera & depth predictions are highly accurate and consistent. Unprojecting our predicted depth with predicted camera parameters yields even more precise point clouds than directly predicted point maps! Try this yourself using the Hugging Face demo 🤗

3 1 5

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

Compared to concurrent CVPR'25 Transformer-based 3D reconstruction methods, VGGT achieves significantly higher accuracy, with speed similar to the fastest variant Fast3R.

1 1

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

Bonus insight: Using pretrained VGGT significantly enhances downstream tasks like:
🚀 Non-rigid point tracking
🚀 Feed-forward novel view synthesis

1 1

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

A strong advantage of our method is the ability to predict 3D attributes without any expensive optimization. For example, 🔸 VGGT can easily process ~200 images in ~10s on a single 40GB A100 GPU 🔸 50x faster than optimization-based methods, using far less memory.

1 1

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

Try our demo live on Hugging Face Spaces!

🤗: huggingface.co/spaces/faceb...

(See demo illustration below) 👇

1 1

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

No expensive optimization needed, yet delivers SOTA results for:

✅ Camera Pose Estimation
✅ Multi-view Depth Estimation
✅ Dense Point Cloud Reconstruction
✅ Point Tracking

1 1

Jianyuan Wang @jianyuanwang.bsky.social · Mar 17

Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds!

Project Page: vgg-t.github.io
Code & Weights: github.com/facebookrese...

3 14 43