Xiangru (Edward) Jian
@edwardjian.bsky.social
7 followers 8 following 10 posts
CS PhD student at University of Waterloo. Visiting Researcher at ServiceNow Research. Working on Al and DB.
Posts Media Videos Starter Packs
Reposted by Xiangru (Edward) Jian
ericzzj.bsky.social
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Wei Pang, Kevin Qinghong Lin, @edwardjian.bsky.social, Xi He, @philiptorr.bsky.social

tl;dr: great stuff!

arxiv.org/abs/2505.21497
edwardjian.bsky.social
🚀 Excited to share that UI-Vision has been accepted at ICML 2025! 🎉

We have also released the UI-Vision grounding datasets. Test your agents on it now! 🚀

🤗 Dataset: huggingface.co/datasets/Ser...

#ICML2025 #AI #DatasetRelease #Agents
edwardjian.bsky.social
We want UI-Vision to be the go-to benchmark for desktop GUI agents.
📢 Data, benchmarks & code coming soon!
💡Next: scaling training data & models for long-horizon tasks.
Let’s build, benchmark & push GUI agents forward 🚀
edwardjian.bsky.social
Interacting with desktop GUIs remains a challenge.
🖱️ Models struggle with click & drag actions due to poor grounding and limited motion understanding
🏆 UI-TARS leads across models!
🧠 Closed models (GPT-4o, Claude, Gemini) excel at planning but fail to localize.
edwardjian.bsky.social
Detecting functional UI regions is tough!
🤖 Even top GUI agents miss functional regions.
🏆 Closed-source VLMs shine with stronger visual understanding.
📉 Cluttered UIs bring down IoU.
🚀 We’re the first to propose this task.
edwardjian.bsky.social
Grounding UI elements is challenging!
🤖 Even top VLMs struggle with fine-grained GUI grounding.
📊 GUI agents like UI-TARS (25.5%) & UGround (23.2%) do better but still fall short.
⚠️ Small elements, dense UIs, and limited domain/spatial understanding are major hurdles.
edwardjian.bsky.social
We propose three key benchmark tasks to evaluate GUI Agents
🔹 Element Grounding – Identify a UI element from the text
🔹 Layout Grounding – Understand UI layout structure & group elements
🔹 Action Prediction – Predict the next action given a goal, past actions & screen state
edwardjian.bsky.social
UI-Vision consists of
✅ 83 open-source desktop apps across 6 domains
✅ 450 human demonstrations of computer-use workflows
✅ Human annotated dense bounding box annotations for UI elements and rich action trajectories
edwardjian.bsky.social
Most GUI benchmarks focus on web or mobile.
🖥️ But what about desktop software, where most real work happens?
UI-Vision fills this gap by providing a large-scale benchmark with diverse and dense annotations to systematically evaluate GUI agents.
edwardjian.bsky.social
🚀 Super excited to announce UI-Vision: the largest and most diverse desktop GUI benchmark for evaluating agents in real-world desktop GUIs in offline settings.

📄 Paper: arxiv.org/abs/2503.15661
🌐 Website: uivision.github.io

🧵 Key takeaways 👇