Eliya Habba
@eliyahabba.bsky.social
54 followers 170 following 7 posts
PhD student at Hebrew University #HebrewU #NLP
Posts Media Videos Starter Packs
eliyahabba.bsky.social
Let’s build a more robust foundation for LLM evaluation!

A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:

@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social
eliyahabba.bsky.social
3. Some instances are consistently easy or hard across ALL prompts, no matter how you prompt: models either always succeed or consistently fail.
eliyahabba.bsky.social
2. Selecting prompt characteristics (e.g., phrasing, enumerators) based on past examples helps efficiently find optimal prompts.
eliyahabba.bsky.social
Key findings from 🕊️ DOVE:

1. Prompt sensitivity is HUGE! Performance varies dramatically with small changes (e. g. ➡ OLMo’s accuracy on HellaSwag ranges from 1% to 99%, simply by changing prompt elements like phrasing, enumerators, and answer order).
eliyahabba.bsky.social
Goal: democratize LLM evaluation research and build meaningful, generalizable methods.

Talk to us about data you'd like to contribute or request evaluations you want to see added to 🕊️ DOVE!
eliyahabba.bsky.social
Care about LLM evaluation? 🤖 🤔

We bring you ️️🕊️ DOVE a massive (250M!) collection of LLMs outputs 
On different prompts, domains, tokens, models...

Join our community effort to expand it with YOUR model predictions & become a co-author!
eliyahabba.bsky.social
🌍 AI is changing the world. Is AI regulation on the right track? 🤔

While regulators rely on benchmarking 📊, we show why it cannot guarantee AI behavior:
arxiv.org/pdf/2501.15693

Excited about this multidisciplinary collaboration!
@gabistanovsky.bsky.social,
@rkeydar.bsky.social , Gadi Perl