Jianyuan Wang
@jianyuanwang.bsky.social
96 followers 110 following 12 posts
https://jytime.github.io/
Posts Media Videos Starter Packs
jianyuanwang.bsky.social
I am trying to. Probably we could hear about this around next submission ddl 😂
jianyuanwang.bsky.social
It seems so (with a short glance only). The techniques used by Fast3R can also be applied to VGGT
jianyuanwang.bsky.social
Haha, this probably serves as an indirect validation of NVIDIA’s stock value.
jianyuanwang.bsky.social
Currently, this training approach is not very stable, but I believe that’s likely because I haven’t yet found the correct training method. I hope this can achieve better results in the future, which could then avoid an explicit modelling of point map.
jianyuanwang.bsky.social
Finally, great work together with Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny!

@oxford-vgg.bsky.social
jianyuanwang.bsky.social
Interesting observation: VGGT’s camera & depth predictions are highly accurate and consistent. Unprojecting our predicted depth with predicted camera parameters yields even more precise point clouds than directly predicted point maps! Try this yourself using the Hugging Face demo 🤗
jianyuanwang.bsky.social
Compared to concurrent CVPR'25 Transformer-based 3D reconstruction methods, VGGT achieves significantly higher accuracy, with speed similar to the fastest variant Fast3R.
jianyuanwang.bsky.social
Bonus insight: Using pretrained VGGT significantly enhances downstream tasks like:
🚀 Non-rigid point tracking
🚀 Feed-forward novel view synthesis
jianyuanwang.bsky.social
A strong advantage of our method is the ability to predict 3D attributes without any expensive optimization. For example, 🔸 VGGT can easily process ~200 images in ~10s on a single 40GB A100 GPU 🔸 50x faster than optimization-based methods, using far less memory.
jianyuanwang.bsky.social
Try our demo live on Hugging Face Spaces!

🤗: huggingface.co/spaces/faceb...

(See demo illustration below) 👇
jianyuanwang.bsky.social
No expensive optimization needed, yet delivers SOTA results for:

✅ Camera Pose Estimation
✅ Multi-view Depth Estimation
✅ Dense Point Cloud Reconstruction
✅ Point Tracking
jianyuanwang.bsky.social
Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds!

Project Page: vgg-t.github.io
Code & Weights: github.com/facebookrese...