Lightnews — Scholar-powered news

Thaddäus Wiedemer

@thwiedemer.bsky.social

46 followers 110 following 8 posts

Intern at Google Deepmind Toronto | PhD student in ML at Max Planck Institute Tübingen and University of Tübingen.

Posts Media Videos Starter Packs

Pinned

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

Are we experiencing a 'GPT moment' in vision?

In our new preprint, we show that generative video models can solve a wide range of tasks across the entire vision stack without being explicitly trained for it.

🌐 video-zero-shot.github.io

1/n

2 1 4

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

I'm truly honored to have worked on this at Google DeepMind with my amazing collaborators!

With 2 months left in my internship, I'm excited about our next steps in this direction!

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

And as with other 'zero-shot' works, it's clear that Veo has been exposed to samples of many of our tasks in the training data. The promise lies into its ability to quickly be adapted for general tasks with just a prompt, no fine-tuning required!

1 1

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

Of course, performance is not perfect yet and lacks behind SotA. Video models are also expensive to train and run, so they won't replace all vision models just yet. But the rapid progress from Veo 2 to Veo 3 illustrates their potential to become vision foundation models.

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

Intuitively, some tasks are easier to directly solve in the vision domain, and we also observe this in maze solving tasks. This makes me super excited about a future where generalist vision and language models could be integrated for reasoning in the real world by 'imagining' possible outcomes.

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

On the reasoning side, videos as 'chain-of-frames' parallel chain-of-thought in LLMs. Complex visual tasks that an image editing model like Nano Banana would have to solve in one go can be broken down into smaller steps.

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

Specifically, Veo 3 can perceive (segment, locacalize, detect edges, ...), model (physics, abstract relations, memory), manipulate (edit images, simulate robotics), and reason about the visual world.

Video models might well become vision foundation models.

Thaddäus Wiedemer @thwiedemer.bsky.social · 13d

2 1 4

Thaddäus Wiedemer @thwiedemer.bsky.social · Feb 18

Check out our newest paper!

As always, it was super fun working on this with @prasannamayil.bsky.social

Prasanna Mayilvahanan @prasannamayil.bsky.social · Feb 18

New preprint out! 🎉

How does LLM training loss translate to downstream performance?

We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8

1 5

Reposted by Thaddäus Wiedemer

Andreas Hochlehnert @ahochlehnert.bsky.social · Feb 17

CuratedThoughts: Data Curation for RL Datasets 🚀

Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.

Here's why 👇🧵

1 9 13

Reposted by Thaddäus Wiedemer

Wieland Brendel @wielandbrendel.bsky.social · Feb 11

🚀 We’re hiring! Join Bernhard Schölkopf & me at @ellisinsttue.bsky.social to push the frontier of #AI in education!

We’re building cutting-edge, open-source AI tutoring models for high-quality, adaptive learning for all pupils with support from the Hector Foundation.

👉 forms.gle/sxvXbJhZSccr...

1 14 8