Lightnews — Scholar-powered news

Yixiao Song @yixiaosong.bsky.social · Mar 12

We continuously update BEARCUBS with challenging questions. If you're developing web agents, use BEARCUBS to benchmark their real-world performance! 🚀

Work done with Katherine Thai
@chautmpham.bsky.social @yapeichang.bsky.social Mazin Nadaf & @miyyer.bsky.social

🌐 bear-cubs.github.io/

2

Yixiao Song @yixiaosong.bsky.social · Mar 12

Future computer-use agents should be enhanced with:

💡 Stronger multimodal reasoning (videos, maps, real-time data)
🔍 More reliable source selection
🗺️ Smarter and more efficient search strategies
📜 Transparent and interpretable browsing trajectories

1 1

Yixiao Song @yixiaosong.bsky.social · Mar 12

❌ No agent excels at video, images, or interactive web content.

Current agents struggle with:
🚨 Selecting reliable sources
🚨 Escaping dead loops
🚨 Engaging in multimodal interactions
🚨 Navigating the web in real-time

1 1

Yixiao Song @yixiaosong.bsky.social · Mar 12

🐻 BEARCUBS 🐻 questions aren't easy! Humans achieve 84.7% accuracy. How well do web agents perform? 🤔

Not great ...
🥴 The best multimodal web agent, OpenAI’s Operator, scores 24.3% accuracy.
🤯 OpenAI’s Deep Research outperforms all (35.1%), without computer-use abilities!

1 2

Yixiao Song @yixiaosong.bsky.social · Mar 12

Why a new web agent benchmark? Cuz popular ones👇
1️⃣ Use simulations (e.g., WebArena), missing real-world complexity
2️⃣ Have limited multimodal testing, relying on HTML (Mind2Web) or specific skills (e.g., map)
3️⃣ Are nearing performance saturation—Operator hits 87% on WebVoyager

1 2

Yixiao Song @yixiaosong.bsky.social · Mar 12

BEARCUBS 👇
🔹Benchmarks computer-using agents @OpenAI Operator, @AnthropicAI Computer Use, and @convergence_ai_ Proxy
🔹Evaluates complex text-based & multimodal interactions
🔹Will be updated regularly with new questions

📜 arxiv.org/abs/2503.07919
🌐 bear-cubs.github.io/

1 3

Yixiao Song @yixiaosong.bsky.social · Mar 12

Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web!
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%

1 5 12