Lightnews — Scholar-powered news

Sara Hooker

@sarahooker.bsky.social

7.6K followers 160 following 50 posts

I lead Cohere For AI. Formerly Research Google Brain. ML Efficiency, LLMs, @trustworthy_ml.

Posts Media Videos Starter Packs

Reposted by Sara Hooker

Princeton Center for Information Technology Policy @princetoncitp.bsky.social · May 5

⚠️ Leaderboard Illusion: "We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release & retract scores if desired..the ability of these providers to choose the best score leads to biased Arena scores"

Paper out now!🔻

illustration of text

Sara Hooker @sarahooker.bsky.social · Apr 30

We tried very hard to get this right, and have spent the last 5 months working carefully to ensure rigor.

If you made it this far, take a look at the full 68 pages: arxiv.org/abs/2504.20879

Any feedback or corrections are of course very welcome.

Sara Hooker @sarahooker.bsky.social · Apr 30

Very proud of this work that we led by
Shivalika Singh and @mziizm.bsky.social with Yiyang Nan, Alex Wang, Daniel D'Souza, @sayash.bsky.social, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, @shaynelongpre.bsky.social
@nlpnoah.bsky.social @beyzaermis.bsky.social

Sara Hooker @sarahooker.bsky.social · Apr 30

This was an uncomfortable paper to work on because it asks us to look in the mirror as a community.

As scientists, we must do better.

As a community, I hope we can demand better. We make very clear the 5 changes needed.

Sara Hooker @sarahooker.bsky.social · Apr 30

Overall, our work suggests that engagement from a handful of providers and preferential policies from
Arena towards the same small group have created conditions to overfit to Arena-specific dynamics rather than general model quality.

Sara Hooker @sarahooker.bsky.social · Apr 30

We show that access to Chatbot Arena data yields substantial benefits.

While using Arena-style data in training boosts win rates by 112%, this improvement doesn't transfer to tasks like MMLU, indicating overfitting to Arena's quirks rather than general performance gains.

Sara Hooker @sarahooker.bsky.social · Apr 30

These data differences stem from some key policies that benefit a handful of providers:

1) proprietary models sampled at higher rates to appear in battles 📶
2) open-weights + open-source models removed from Arena more often 🚮
3) How many private variants 🔍

Sara Hooker @sarahooker.bsky.social · Apr 30

We also observe large differences in Arena Data Access

Chatbot Arena is a open community resource that provides free feedback but 61.3% of all data goes to proprietary model providers.

Sara Hooker @sarahooker.bsky.social · Apr 30

We even do real world private testing using Aya Vision models to show the gains you can expect.

Even when you test identical checkpoints we see gains. This is the most conservative case where quality is identical.

Sara Hooker @sarahooker.bsky.social · Apr 30

There is no reasonable scientific justification for this practice.

Being able to choose the best score to disclose enables systematic gaming of Arena score.

This advantage increases with number of variants and if all other providers don’t know they can also private test.

Sara Hooker @sarahooker.bsky.social · Apr 30

There is an unspoken policy of hidden testing that benefits a small subset of providers.

Providers can choose what score to disclose and retract all others.

At an extreme, we see testing of up to 27 models in lead up to releases.

Sara Hooker @sarahooker.bsky.social · Apr 30

We spent 5 months analyzing 2.8M battles on the Arena, covering 238 models across 43 providers.

We show that preferential policies engaged in by a handful of providers lead to overfitting to Arena-specific metrics rather than genuine AI progress.

Sara Hooker @sarahooker.bsky.social · Apr 30

It is critical for scientific integrity that we trust our measure of progress.

The @lmarena.bsky.social has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on the Arena, despite best intentions.

Reposted by Sara Hooker

Marzieh Fadaee @mziizm.bsky.social · Apr 30

1/ Science is only as strong as the benchmarks it relies on.

So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?

We took a deep dive into Chatbot Arena to find out. 🧵

Reposted by Sara Hooker

Jonathan Wenger @jwenger.bsky.social · Apr 16

This has been a topic close to my heart for a long time.

We have an awesome lineup of speakers who have made deep contributions to open-source in ML, e.g. @sarahooker.bsky.social , @chrisrackauckas.bsky.social, Matt Johnson, Tri Dao, @stellaathena.bsky.social, Evan Shelhamer.

Frank Schneider @fsschneider.bsky.social · Apr 16

Tired of your open-source ML work not getting the academic recognition it deserves? 🤔 Submit to the first-ever CodeML workshop at #ICML2025! It focuses on new libraries, improvements to established ones, best practices, retrospectives, and more.
codeml-workshop.github.io/codeml2025/

CODEML Workshop

Championing Open-source Development in Machine Learning.

codeml-workshop.github.io

Reposted by Sara Hooker

Isra Salazar @israsalazar.bsky.social · Apr 10

Today we are releasing Kaleidoscope 🎉

A comprehensive multimodal & multilingual benchmark for VLMs! It contains real questions from exams in different languages.

🌍 20,911 questions and 18 languages
📚 14 subjects (STEM → Humanities)
📸 55% multimodal questions

Sara Hooker @sarahooker.bsky.social · Mar 19

It is rare I get to completely disconnect. Very grateful for this week in Patagonia.

Reposted by Sara Hooker

Cohere Labs @cohereforai.bsky.social · Mar 5

We're particularly proud to release Aya Vision 8B - it's compact 🐭 and efficient 🐎, outperforming models up to 11x its size 📈.

Releasing open weights helps to make breakthroughs in VLMs accessible to the research community.

Reposted by Sara Hooker

Cohere Labs @cohereforai.bsky.social · Mar 6

Just 2 days after launch, Aya Vision is trending on
@hf.co 🔥🔥

We launched open-weights with the goal of making VLM breakthroughs accessible to the research community - so exciting to see such a positive response.

huggingface.co/CohereForAI/...

Reposted by Sara Hooker

(((Steve Chapman))) @stevechapman.bsky.social · Mar 3

Love this post by @sarahooker.bsky.social on that other platform: "The first step of any meaningful pursuit is to severely underestimate its difficulty."

Reposted by Sara Hooker

Cohere Labs @cohereforai.bsky.social · Mar 4

Introducing ✨ Aya Vision ✨ - an open-weights model to connect our world through language and vision

Aya Vision adds breakthrough multimodal capabilities to our state-of-the-art multilingual 8B and 32B models. 🌿

Reposted by Sara Hooker

Cohere Labs @cohereforai.bsky.social · Feb 27

👀

Reposted by Sara Hooker

Cohere Labs @cohereforai.bsky.social · Feb 25

An important topic in AI is the climate impacts of the energy-intensive computing hardware needed to train and deploy AI models ⚡

Our policy primer explores ways to move towards more sustainable AI. 🌱

📜 cohere.com/research/pap...

Reposted by Sara Hooker

Cohere Labs @cohereforai.bsky.social · Feb 11

Does more compute equate with greater risk?⚡️What is our track record predicting what risks emerge with scale? 📈

In this work led by Sara Hooker, we seek to understand the viability of compute thresholds ⚖️ as a way to mitigate risk. 🦺

arxiv.org/abs/2407.05694

Reposted by Sara Hooker

Cohere Labs @cohereforai.bsky.social · Feb 18

In this work, we ask "How does model merging stack up when optimizing language models for diverse multitask learning?" 📚🧩

📜https://arxiv.org/abs/2410.10801