Juan Rodriguez
@joanrod.bsky.social
24 followers 24 following 9 posts
AI Researcher. Working on Multimodal AI at ServiceNow, Mila joanrod.github.io
Posts Media Videos Starter Packs
Reposted by Juan Rodriguez
alex-lacoste.bsky.social
We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
Reposted by Juan Rodriguez
chrisjpal.bsky.social
LLMs have a lot of potential for science, but scientists can be particularly sensitive to factuality, nuances, and hallucinations. The new ScholarQABench benchmark in this paper looks pretty useful for the community to monitor progress on LLMs for science. arxiv.org/html/2411.14199
joanrod.bsky.social
Also, we are currently at NeurIPS in Vancouver! We will be presenting this work in the RBFM workshop on Saturday! Come say hi, and let’s spark some collaborations! 🚀
joanrod.bsky.social
This was a monumental collaboration, and a huge thank you to all the co-authors, ServiceNow Research, Mila, and all the institutions involved for their incredible support! 🙏
joanrod.bsky.social
We hope this effort aids the community in building more robust models for these tasks while emphasizing the importance of open and transparent data usage and release.
joanrod.bsky.social
We evaluated several VLM models—both open and closed source—on BigDocs-Bench to build a leaderboard.

📊 Models trained on BigDocs outperformed all models on BigDocs-Bench tasks and delivered rebust performance on established benchmarks.
✅ Human evaluations confirmed their strong performance!
joanrod.bsky.social
To validate the quality of the BigDocs datasets, we trained several VLMs on BigDocs-7.5M and evaluated their performance on document-specific and general VLM benchmarks.

The results? Training on BigDocs provides significant boosts compared to training on other datasets! 📈✨
joanrod.bsky.social
We introduce BigDocs-Bench, a set of benchmarks that focus on:

📄 Document Understanding
🌐 Web and GUI reasoning
👨‍💻 Code Generation

We also tackle complex outputs like SVG, LaTeX code, Markdown, and HTML, including very long and structured formats. Here are some examples
joanrod.bsky.social

By sharing this journey, we aim to bring more transparency to how datasets are built—especially as data remains the most opaque aspect of model performance in today’s fast-moving AI landscape. 🌟
joanrod.bsky.social
Building BigDocs was no small feat! We curated a large-scale dataset from diverse, license-friendly sources and documented the entire process.
joanrod.bsky.social
🎉 Excited to introduce BigDocs!
An open, transparent multimodal dataset designed for:
📄 Documents
🌐 Web content
🖥️ GUI understanding
👨‍💻 Code generation from images
We’re also launching BigDocs-Bench:
➡️ Document, Web, GUI Visual reasoning
➡️ Converting images into JSON, Markdown, LaTeX, SVG, and more!