I design tools and processes to support principled evaluation of AI systems.
lukeguerdan.com
We run experiments on 11 rating tasks and find that measuring the agreement with respect to forced-choice ratings (e.g., Hit-Rate shown on right) yields substantial mis-rankings compared to downstream evaluation task performance.
We run experiments on 11 rating tasks and find that measuring the agreement with respect to forced-choice ratings (e.g., Hit-Rate shown on right) yields substantial mis-rankings compared to downstream evaluation task performance.
Is this toxic? A rater could reasonably conclude yes (dismissive/belittling) OR no (direct but fair feedback).
Is this toxic? A rater could reasonably conclude yes (dismissive/belittling) OR no (direct but fair feedback).
💡 Simplicity
⚙️ Resource requirements
🎯 Predictive performance
🌎 Portability
💡 Simplicity
⚙️ Resource requirements
🎯 Predictive performance
🌎 Portability
But how does target variable construction unfold in practice, and how can we better support it going forward? #CSCW2025 🧵
But how does target variable construction unfold in practice, and how can we better support it going forward? #CSCW2025 🧵
Sign up for a 45-minute Zoom session to provide feedback on a new tool for building trustworthy evals.
Learn more at tinyurl.com/llm-as-a-judge - receive $35 for participating in a session!
Sign up for a 45-minute Zoom session to provide feedback on a new tool for building trustworthy evals.
Learn more at tinyurl.com/llm-as-a-judge - receive $35 for participating in a session!