Eliya Habba
eliyahabba.bsky.social
Eliya Habba
@eliyahabba.bsky.social
PhD student at Hebrew University #HebrewU #NLP
Let’s build a more robust foundation for LLM evaluation!

A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:

@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social
March 17, 2025 at 2:43 PM
3. Some instances are consistently easy or hard across ALL prompts, no matter how you prompt: models either always succeed or consistently fail.
March 17, 2025 at 2:39 PM
2. Selecting prompt characteristics (e.g., phrasing, enumerators) based on past examples helps efficiently find optimal prompts.
March 17, 2025 at 2:39 PM
Key findings from 🕊️ DOVE:

1. Prompt sensitivity is HUGE! Performance varies dramatically with small changes (e. g. ➡ OLMo’s accuracy on HellaSwag ranges from 1% to 99%, simply by changing prompt elements like phrasing, enumerators, and answer order).
March 17, 2025 at 2:38 PM
Goal: democratize LLM evaluation research and build meaningful, generalizable methods.

Talk to us about data you'd like to contribute or request evaluations you want to see added to 🕊️ DOVE!
March 17, 2025 at 2:38 PM