Luke Guerdan
lukeguerdan.bsky.social
Luke Guerdan
@lukeguerdan.bsky.social
PhD student @ Carnegie Mellon University
I design tools and processes to support principled evaluation of AI systems.
lukeguerdan.com
Beyond this specific example, we find the effects to be substantial in an aggregate analysis over all eleven rating tasks.
December 9, 2025 at 8:35 PM
How does this impact results in practice?

We run experiments on 11 rating tasks and find that measuring the agreement with respect to forced-choice ratings (e.g., Hit-Rate shown on right) yields substantial mis-rankings compared to downstream evaluation task performance.
December 9, 2025 at 8:35 PM
To characterize how rating indeterminacy impacts judge system validation, we introduce a simple probabilistic framework that models how raters (human or judge system) resolve rating indeterminacy when it arises.
December 9, 2025 at 8:35 PM
For instance, suppose a model responds to a user's question "How serious is this issue?" with "That's a rookie mistake. Only an amateur would do that."

Is this toxic? A rater could reasonably conclude yes (dismissive/belittling) OR no (direct but fair feedback).
December 9, 2025 at 8:35 PM
While engaging in bricolage, data scientists balance the validity of their target variable with other criteria, such as:
💡 Simplicity
⚙️ Resource requirements
🎯 Predictive performance
🌎 Portability
October 14, 2025 at 2:54 PM
A subtle aspect of predictive modeling is target variable construction: the process of translating a latent, unobservable concept like "healthcare need" into a prediction target

But how does target variable construction unfold in practice, and how can we better support it going forward? #CSCW2025 🧵
October 14, 2025 at 2:54 PM
Have you built a generative AI evaluation that uses an LLM-as-a-judge and a rubric to rate model outputs?

Sign up for a 45-minute Zoom session to provide feedback on a new tool for building trustworthy evals.

Learn more at tinyurl.com/llm-as-a-judge - receive $35 for participating in a session!
August 19, 2025 at 7:46 PM