Dhruv Batra
banner
dhruvbatra.bsky.social
Dhruv Batra
@dhruvbatra.bsky.social
400 followers 93 following 37 posts
Co-founder & Chief Scientist at Yutori. Prev: Senior Director leading FAIR Embodied AI at Meta, and Professor at Georgia Tech.
Posts Media Videos Starter Packs
Solved: robustness to paraphrasing and false premises, OCR, world-knowledge based reasoning.

Open: spatial reasoning, data-efficiency, learning compatible representations.
As part of the award ceremony, VQA team presented a recap of vision-and-language research over the last decade — solved problems, progress, and open-challenges for mutimodal LLMs.
Lots to be done. Thank you to all our collaborators and the research community for this recognition!
Fun-fact: the T-shirt I'm wearing is an inside joke about the quality of 2015 models.

However, every few years we rediscover the lesson that on difficult tasks, VLMs silently regress to being nearly blind.

x.com/DhruvBatra_/...
VQA challenge series won the Mark Everingham prize at #ICCV2025 for stimulating a new strand of vision-and-language research.

It's extra special because ICCV25 marks the 10-year anniversary of the VQA paper.

When we started, the idea of answering any question about any image seemed outlandish.
I dunno man, Dagger is cool.
The problem with “AI slop” isn’t the AI — it’s the slop.

People act like AI is the issue, when it’s actually part of the fix.

If we're honest: most of what we make, most of the time, is slop by our own standards.

That’s the generator–discriminator gap in creative work that Ira Glass talks about.
Somebody is a fan of Abundance
It is so refreshing to see conferences innovate on the reviewing model and run actual experiments (!) as opposed to fighting change.
For #ICLR2025, we piloted an LLM that provided optional feedback to some reviewers. Results are promising: over 12K suggestions were incorporated by reviewers to improve review quality. See our blog post for details and more analysis blog.iclr.cc/2025/04/15/l...
Leveraging LLM feedback to enhance review quality – ICLR Blog
blog.iclr.cc
Good. Autonomous interface locomotion is the fundamental robotics problem of our time. The more the merrier.
The answer to many "why X?" questions:

Because the laws of physics do not prohibit X and the forces of biology gave us curiosity.
The web is the ultimate boss-level for agents — dynamic, non-deterministic, and noisy; some mistakes are inevitable and so far, every agent fails eventually.

Yutori is building superhuman agents for this ultimate digital environment.

Join our waitlist for early access to our product!

yutori.com
Yutori
We’re building AI agents that can reliably do everyday digital tasks for you on the web, towards an AI chief-of-staff for everyone.
yutori.com
𝐈𝐦𝐚𝐠𝐢𝐧𝐞 𝐚 𝐰𝐨𝐫𝐥𝐝 𝐰𝐡𝐞𝐫𝐞 𝐧𝐨 𝐡𝐮𝐦𝐚𝐧 𝐡𝐚𝐬 𝐭𝐨 𝐝𝐢𝐫𝐞𝐜𝐭𝐥𝐲 𝐢𝐧𝐭𝐞𝐫𝐚𝐜𝐭 𝐰𝐢𝐭𝐡 𝐭𝐡𝐞 𝐰𝐞𝐛 𝐚𝐠𝐚𝐢𝐧.

Where teams of AI assistants coordinate to book flights, manage budgets, or file paperwork—proactively surfacing insights and correcting errors.

Only problem — no one knows how to build AI agents that actually work.
I started something new last year with a wonderful group of people. We showed a demo in Jan.

Today, we’re telling our story — show before you talk!

𝘞𝘦 𝘢𝘳𝘦 𝘳𝘦-𝘪𝘮𝘢𝘨𝘪𝘯𝘪𝘯𝘨 𝘩𝘰𝘸 𝘱𝘦𝘰𝘱𝘭𝘦 𝘪𝘯𝘵𝘦𝘳𝘢𝘤𝘵 𝘸𝘪𝘵𝘩 𝘵𝘩𝘦 𝘸𝘦𝘣 — one of humanity’s greatest inventions and a a mess overdue for an overhaul.

yutori.com
Ah, understood. No idea about the tracing of that meme.
Seems like the ultimate thing to rally around, no? To the extent there is any purpose, what's the alternative?
I'm already there for low-stakes queries.
Where's the skepticism coming from? Now that web search and citations are in there, isn't it easy to verify and thus become more confident?
Reposted by Dhruv Batra
📢Excited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025!
🌐 sites.google.com/view/vlms4all
Using a locally-running LLM to translate a review is explicitly prohibited by @iccv.bsky.social

Why? Whom does this possibly harm?
The way it's always been done isn't handling the current scale well (as evidenced by the feedback from the authors). Yes, outsource to a company, pay for creation of new tools, start new companies, all of the standard ways of addressing a growing market.
Why is it volunteer work? Why doesn't an organization that takes in millions in sponsorship professionalize?