gaoyuezhou.bsky.social
@gaoyuezhou.bsky.social
Huge thanks to all my collaborators who made this project possible @hengkaipan.bsky.social, @yann-lecun.bsky.social, @lerrelpinto.com
We have open-sourced our code and data. For more details, checkout the paper and website:
Website: dino-wm.github.io
arXiv: arxiv.org/abs/2411.04983
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
dino-wm.github.io
January 31, 2025 at 7:24 PM
Overall, DINO-WM takes a step toward bridging the gap between task-agnostic world modeling and reasoning and control, offering promising prospects for generic world models in real-world applications.
January 31, 2025 at 7:24 PM
The object and spatial understanding priors of DINOv2 features enable robust scene understanding, essential for navigation and manipulation tasks. With this prior, DINO-WM outperforms state-of-the-art world models by 45% in downstream task performance on our hardest tasks.
January 31, 2025 at 7:24 PM
DINO-WM consists of:

1️⃣An out-of-the-box DINOv2 model as the observation model.
2️⃣A causal ViT as the predictor.
3️⃣A decoder that is optional for visualization.

DINO-WM plans entirely in latent space, without the need to reconstruct pixel images.
January 31, 2025 at 7:24 PM
Unlike previous works that couple world model learning with behavior learning, we train a dynamics-only model and infer actions only at test time. This allows zero-shot goal-reaching by reasoning through the dynamics—no expert demonstrations, no rewards, no online interactions.
January 31, 2025 at 7:24 PM