@gaoyuezhou.bsky.social
26 followers 19 following 6 posts
Posts Media Videos Starter Packs
gaoyuezhou.bsky.social
Overall, DINO-WM takes a step toward bridging the gap between task-agnostic world modeling and reasoning and control, offering promising prospects for generic world models in real-world applications.
gaoyuezhou.bsky.social
The object and spatial understanding priors of DINOv2 features enable robust scene understanding, essential for navigation and manipulation tasks. With this prior, DINO-WM outperforms state-of-the-art world models by 45% in downstream task performance on our hardest tasks.
gaoyuezhou.bsky.social
DINO-WM consists of:

1️⃣An out-of-the-box DINOv2 model as the observation model.
2️⃣A causal ViT as the predictor.
3️⃣A decoder that is optional for visualization.

DINO-WM plans entirely in latent space, without the need to reconstruct pixel images.
gaoyuezhou.bsky.social
Unlike previous works that couple world model learning with behavior learning, we train a dynamics-only model and infer actions only at test time. This allows zero-shot goal-reaching by reasoning through the dynamics—no expert demonstrations, no rewards, no online interactions.
gaoyuezhou.bsky.social
Can we extend the power of world models beyond just online model-based learning? Absolutely!

We believe the true potential of world models lies in enabling agents to reason at test time.
Introducing DINO-WM: World Models on Pre-trained Visual Features for Zero-shot Planning.