A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:
@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social
A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:
@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social
1. Prompt sensitivity is HUGE! Performance varies dramatically with small changes (e. g. ➡ OLMo’s accuracy on HellaSwag ranges from 1% to 99%, simply by changing prompt elements like phrasing, enumerators, and answer order).
1. Prompt sensitivity is HUGE! Performance varies dramatically with small changes (e. g. ➡ OLMo’s accuracy on HellaSwag ranges from 1% to 99%, simply by changing prompt elements like phrasing, enumerators, and answer order).
Talk to us about data you'd like to contribute or request evaluations you want to see added to 🕊️ DOVE!
Talk to us about data you'd like to contribute or request evaluations you want to see added to 🕊️ DOVE!