Lightnews — Scholar-powered news

David Heineman

@davidheineman.com

28 followers 180 following 6 posts

Pre-doc @ai2.bsky.social
davidheineman.com

Posts Replies Media Videos

David Heineman

@davidheineman.com

Evaluating language models is tricky, how do we know if our results are real, or due to random chance?

We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵

August 19, 2025 at 4:46 PM

Reposted by David Heineman

Ai2

@ai2.bsky.social

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.

The RewardBench 2 Leaderboard on HuggingFace.

June 2, 2025 at 4:31 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news