Scholar

Hanna Wallach

Hanna Megan Wallach is a computational social scientist and partner research manager at Microsoft Research. Her work makes use of… more

Hanna Wallach
H-index: 45
Computer science 69%
Physics 8%
hannawallach.bsky.social
This is happening now!!!
hannawallach.bsky.social
If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...
ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025
icml.cc
hannawallach.bsky.social
Oh whoops! You are indeed correct -- it starts at 11am PT!
hannawallach.bsky.social
If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...
ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025
icml.cc
hannawallach.bsky.social
I also want to note that this paper has been in progress for many, many years, so we're super excited it's finally being published. It's also one of the most genuinely interdisciplinary projects I've ever worked on, which has made it particularly challenging and rewarding!!! ❤️
hannawallach.bsky.social
Check out the camera-ready version of our ACL Findings paper ("Taxonomizing Representational Harms using Speech Act Theory") to learn more!!! arxiv.org/pdf/2504.00928
arxiv.org
hannawallach.bsky.social
Why does this matter? You can't mitigate what you can't measure, and our framework and taxonomy help researchers and practitioners design better ways to measure and mitigate representational harms caused by generative language systems.
hannawallach.bsky.social
Using this theoretical grounding, we provide new definitions for stereotyping, demeaning, and erasure, and break them down into a detailed taxonomy of system behaviors. By doing this, we unify many of the different ways representational harms have been previously defined.
hannawallach.bsky.social
We bring some much-needed clarity by turning to speech act theory—a theory of meaning from linguistics that allows us to distinguish between a system output’s purpose and its real-world impacts.
hannawallach.bsky.social
These are often called “representational harms,” and while they’re easy for people to recognize when they see them, definitions of these harms are commonly under-specified, leading to conceptual confusion. This makes them hard to measure and even harder to mitigate.
hannawallach.bsky.social
Generative language systems are everywhere, and many of them stereotype, demean, or erase particular social groups.
hannawallach.bsky.social
Real talk: GenAI systems aren't toys. Bad evaluations don't just waste people's time---they can cause real-world harms. It's time to level up, ditch the apples-to-oranges comparisons, and start doing measurement like we mean it.

(5/6)
hannawallach.bsky.social
We propose a framework that cuts through the chaos: first, get crystal clear on what you're measuring and why (no more vague hand-waving); then, figure out how to measure it; and, throughout the process, interrogate validity like your reputation depends on it---because, honestly, it should.

(4/6)
hannawallach.bsky.social
Here's our hot take: evaluating GenAI systems isn't just some techie puzzle---it's a social science measurement challenge.

(3/6)
hannawallach.bsky.social
But there's a dirty little secret: the ways we evaluate GenAI systems are often sloppy, vague, and quite frankly... not up to the task.

(2/6)
hannawallach.bsky.social
Alright, people, let's be honest: GenAI systems are everywhere, and figuring out whether they're any good is a total mess. Should we use them? Where? How? Do they need a total overhaul?

(1/6)
hannawallach.bsky.social
I'm so excited this paper is finally online!!! 🎉 We had so much fun working on this with @emmharv.bsky.social!!! Thread below summarizing our contributions...
emmharv.bsky.social
📣 "Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems" is forthcoming at #ACL2025NLP - and you can read it now on arXiv!

🔗: arxiv.org/pdf/2506.04482
🧵: ⬇️
A screenshot of our paper: 

Title: Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Authors: Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach

Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments---even useful instruments---are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.
hannawallach.bsky.social
Please spread the word to anyone who you think might be interested! We will begin reviewing applications on June 2.
hannawallach.bsky.social
This program is open to candidates who will have completed their bachelor's degree (or equiv.) by Summer 2025 (inc. those who graduated previously and have been working or doing a master's degree) and who want to advance their research skills before applying to PhD programs.
hannawallach.bsky.social
Exciting news!!! This just got into @icmlconf.bsky.social as a position paper!!! 🎉 More updates to come as we work on the camera-ready version!!!
hannawallach.bsky.social
Remember this @neuripsconf.bsky.social workshop paper? We spent the past month writing a newer, better, longer version!!! You can find it online here: arxiv.org/abs/2502.00561
hannawallach.bsky.social
Thank you for posting! Very timely as the paper just got accepted to ICML's position paper track!

by Hanna WallachReposted by: Hanna Wallach

amabalayn.bsky.social
At the #HEAL workshop, I'll present "Systematizing During Measurement Enables Broader Stakeholder Participation" on the ways we can further structure LLM evaluations and open them for deliberation. A project led by @hannawallach.bsky.social

by Hanna WallachReposted by: Hanna Wallach

jennwv.bsky.social
2. Also Saturday, @amabalayn.bsky.social will represent our piece arguing that systematization during measurement enables broad stakeholder participation in AI evaluation.

This came out of a huge group collaboration led by @hannawallach.bsky.social: bsky.app/profile/hann...

heal-workshop.github.io

by Hanna WallachReposted by: Hanna Wallach

katekaye.bsky.social
Reading - Evaluating Evaluations for GenAI from
@hannawallach.bsky.social madesai.bsky.social afedercooper.bsky.social et al-This work dovetails with our work at
@worldprivacyforum.bsky.social on measuring AI governance tools from governments, through privacy/ policy lens arxiv.org/pdf/2502.00561

References

Fields & subjects

Updated 1m