Lightnews — Scholar-powered news

Eliya Habba

@eliyahabba.bsky.social

54 followers 170 following 7 posts

PhD student at Hebrew University #HebrewU #NLP

Posts Media Videos Starter Packs

Eliya Habba @eliyahabba.bsky.social · Mar 17

Let’s build a more robust foundation for LLM evaluation!

A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:

@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social

Eliya Habba @eliyahabba.bsky.social · Mar 17

3. Some instances are consistently easy or hard across ALL prompts, no matter how you prompt: models either always succeed or consistently fail.

1 1

Eliya Habba @eliyahabba.bsky.social · Mar 17

2. Selecting prompt characteristics (e.g., phrasing, enumerators) based on past examples helps efficiently find optimal prompts.

1 1

Eliya Habba @eliyahabba.bsky.social · Mar 17

Key findings from 🕊️ DOVE:

1. Prompt sensitivity is HUGE! Performance varies dramatically with small changes (e. g. ➡ OLMo’s accuracy on HellaSwag ranges from 1% to 99%, simply by changing prompt elements like phrasing, enumerators, and answer order).

1 1

Eliya Habba @eliyahabba.bsky.social · Mar 17

Goal: democratize LLM evaluation research and build meaningful, generalizable methods.

Talk to us about data you'd like to contribute or request evaluations you want to see added to 🕊️ DOVE!

1 2

Eliya Habba @eliyahabba.bsky.social · Mar 17

Care about LLM evaluation? 🤖 🤔

We bring you ️️🕊️ DOVE a massive (250M!) collection of LLMs outputs
On different prompts, domains, tokens, models...

Join our community effort to expand it with YOUR model predictions & become a co-author!

1 3 11

Eliya Habba @eliyahabba.bsky.social · Feb 3

🌍 AI is changing the world. Is AI regulation on the right track? 🤔

While regulators rely on benchmarking 📊, we show why it cannot guarantee AI behavior:
arxiv.org/pdf/2501.15693

Excited about this multidisciplinary collaboration!
@gabistanovsky.bsky.social,
@rkeydar.bsky.social , Gadi Perl