I design tools and processes to support principled evaluation of AI systems.
lukeguerdan.com
This project was also part of an internship with the FATE group at Microsoft Research NYC. Apply now for the next cycle! ✨ apply.careers.microsoft.com/careers/job/...
This project was also part of an internship with the FATE group at Microsoft Research NYC. Apply now for the next cycle! ✨ apply.careers.microsoft.com/careers/job/...
Blog: blog.ml.cmu.edu/2025/12/09/v...
Paper: arxiv.org/pdf/2503.05965
Code: github.com/lguerdan/ind...
Blog: blog.ml.cmu.edu/2025/12/09/v...
Paper: arxiv.org/pdf/2503.05965
Code: github.com/lguerdan/ind...
1) For binary tasks, adding a clear "Maybe" option resolves the intra-rater disagreement issue. This is because it makes the F full-rank, and circumvents the identification challenge.
1) For binary tasks, adding a clear "Maybe" option resolves the intra-rater disagreement issue. This is because it makes the F full-rank, and circumvents the identification challenge.
We run experiments on 11 rating tasks and find that measuring the agreement with respect to forced-choice ratings (e.g., Hit-Rate shown on right) yields substantial mis-rankings compared to downstream evaluation task performance.
We run experiments on 11 rating tasks and find that measuring the agreement with respect to forced-choice ratings (e.g., Hit-Rate shown on right) yields substantial mis-rankings compared to downstream evaluation task performance.
As a result, we can have high human--judge agreement w.r.t forced-choice ratings, while having low agreement w.r.t multi-label "response set" ratings.
As a result, we can have high human--judge agreement w.r.t forced-choice ratings, while having low agreement w.r.t multi-label "response set" ratings.
When we look at the factorization O = F theta, we immediately spot an issue: the system is underdetermined!
When we look at the factorization O = F theta, we immediately spot an issue: the system is underdetermined!
Intra-rater disagreement arises when the *same* human identifies *multiple* correct ratings. We call this intra-rater disagreement rating indeterminacy.
Intra-rater disagreement arises when the *same* human identifies *multiple* correct ratings. We call this intra-rater disagreement rating indeterminacy.
Is this toxic? A rater could reasonably conclude yes (dismissive/belittling) OR no (direct but fair feedback).
Is this toxic? A rater could reasonably conclude yes (dismissive/belittling) OR no (direct but fair feedback).
This work was in collaboration with the amazing team @devsaxena.bsky.social (co-first author), @schancellor.bsky.social, @zstevenwu.bsky.social , and @kenholstein.bsky.social
Thank you for making my first adventure into qualitative research a delightful experience :)
This work was in collaboration with the amazing team @devsaxena.bsky.social (co-first author), @schancellor.bsky.social, @zstevenwu.bsky.social , and @kenholstein.bsky.social
Thank you for making my first adventure into qualitative research a delightful experience :)
- Protocols to help data scientists identify minimum standards for validity and other criteria, tailored to their specific application context
- Tools designed to help data scientists identify and apply strategies more effectively
- Protocols to help data scientists identify minimum standards for validity and other criteria, tailored to their specific application context
- Tools designed to help data scientists identify and apply strategies more effectively
For example, they use "swapping" to change target variables when the first has unanticipated challenges, or "composing" to capture complementary dimensions of a concept being captured in a target variable
For example, they use "swapping" to change target variables when the first has unanticipated challenges, or "composing" to capture complementary dimensions of a concept being captured in a target variable