andrea wang
@andreawwenyi.bsky.social
1.9K followers 56 following 15 posts
phd @ cornell infosci https://andreawwenyi.github.io
Posts Media Videos Starter Packs
Reposted by andrea wang
jennahgosciak.bsky.social
I am presenting a new 📝 “Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments” at @facct.bsky.social on Thursday, with @aparnabee.bsky.social, Derek Ouyang, @allisonkoe.bsky.social, @marzyehghassemi.bsky.social, and Dan Ho. 🔗: arxiv.org/abs/2506.13735
(1/n)
"Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments"

Conducting disparity assessments at regular time intervals is critical for surfacing potential biases in decision-making and improving outcomes across demographic groups. Because disparity assessments fundamentally depend on the availability of demographic information, their efficacy is limited by the availability and consistency of available demographic identifiers. While prior work has considered the impact of missing data on fairness, little attention has been paid to the role of delayed demographic data. Delayed data, while eventually observed, might be missing at the critical point of monitoring and action -- and delays may be unequally distributed across groups in ways that distort disparity assessments. We characterize such impacts in healthcare, using electronic health records of over 5M patients across primary care practices in all 50 states. Our contributions are threefold. First, we document the high rate of race and ethnicity reporting delays in a healthcare setting and demonstrate widespread variation in rates at which demographics are reported across different groups. Second, through a set of retrospective analyses using real data, we find that such delays impact disparity assessments and hence conclusions made across a range of consequential healthcare outcomes, particularly at more granular levels of state-level and practice-level assessments. Third, we find limited ability of conventional methods that impute missing race in mitigating the effects of reporting delays on the accuracy of timely disparity assessments. Our insights and methods generalize to many domains of algorithmic fairness where delays in the availability of sensitive information may confound audits, thus deserving closer attention within a pipeline-aware machine learning framework. Figure contrasting a conventional approach to conducting disparity assessments, which is static, to the analysis we conduct in this paper. Our analysis (1) uses comprehensive health data from over 1,000 primary care practices and 5 million patients across the U.S., (2) timestamped information on the reporting of race to measure delay, and (3) retrospective analyses of disparity assessments under varying levels of delay.
Reposted by andrea wang
emmharv.bsky.social
I am so excited to be in 🇬🇷Athens🇬🇷 to present "A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms" by me, @kizilcec.bsky.social, and @allisonkoe.bsky.social, at #FAccT2025!!

🔗: arxiv.org/pdf/2506.04419
A screenshot of our paper's:

Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms
Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke
Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.
Reposted by andrea wang
johngmarks.com
Worth noting today that the entire budget of the NEH is about $200M.
emptywheel.bsky.social
According to acting DOD Comptroller Bryn McDonnell it'll cost $134M for the deployment of the Guard to Los Angeles.
Reposted by andrea wang
lucy3.bsky.social
I'm joining Wisconsin CS as an assistant professor in fall 2026!! There, I'll continue working on language models, computational social science, & responsible AI. 🌲🧀🚣🏻‍♀️ Apply to be my PhD student!

Before then, I'll postdoc for a year in the NLP group at another UW 🏔️ in the Pacific Northwest
Wisconsin-Madison's tree-filled campus, next to a big shiny lake A computer render of the interior of the new computer science, information science, and statistics building. A staircase crosses an open atrium with visibility across multiple floors
Reposted by andrea wang
mariaa.bsky.social
Slightly paraphrasing @oms279.bsky.social during his talk at #COMPTEXT2025:

"The single most important use case for LLMs in sociology is turning unstructured data into structured data."

Discussing his recent work on codebooks, prompts, and information extraction: osf.io/preprints/so...
Reposted by andrea wang
simonaliao.bsky.social
Hi everyone, I am excited to share our large-scale survey study with 800+ researchers, which reveals researchers’ usage and perceptions of LLMs as research tools, and how the usage and perceptions differ based on demographics.

See results in comments!

🔗 Arxiv link: arxiv.org/abs/2411.05025
LLMs as Research Tools: A Large Scale Survey of Researchers' Usage and Perceptions
The rise of large language models (LLMs) has led many researchers to consider their usage for scientific work. Some have found benefits using LLMs to augment or automate aspects of their research pipe...
arxiv.org
andreawwenyi.bsky.social
China is a nation with over a hundred minority languages and many ethnic groups. What does this say about China’s 21st century AI policy?
andreawwenyi.bsky.social
This suggests a break from China’s past stance of using inclusive language policy as a way to build a multiethnic nation. We see no evidence of socio-political pressure or carrots for Chinese AI groups to dedicate resources for linguistic inclusivity.
andreawwenyi.bsky.social
In fact, many LLMs from China fail to even recognize some lower resource Chinese languages such as Uyghur.
andreawwenyi.bsky.social
LLMs from China are highly correlated with Western LLMs in multilingual performance (0.93 - 9.99) on tasks such as reading comprehension.
andreawwenyi.bsky.social
[New preprint!] Do Chinese AI Models Speak Chinese Languages? Not really. Chinese LLMs like DeepSeek are better at French than Cantonese. Joint work with
Unso Jo and @dmimno.bsky.social . Link to paper: arxiv.org/pdf/2504.00289
🧵
Reposted by andrea wang
sungkim.bsky.social
You’ve probably heard about how AI/LLMs can solve Math Olympiad problems ( deepmind.google/discover/blo... ).

So naturally, some people put it to the test — hours after the 2025 US Math Olympiad problems were released.

The result: They all sucked!
Reposted by andrea wang
travislloydphd.bsky.social
*NEW DATASET AND PAPER* (CHI2025): How are online communities responding to AI-generated content (AIGC)? We study this by collecting and analyzing the public rules of 300,000+ subreddits in 2023 and 2024. 1/
Reposted by andrea wang
cfiesler.bsky.social
hey it's that time of year again, when people start to wonder whether AIES is actually happening and when this year’s paper deadline might be if so! anyone know anything about the ACM/AAAI conference on AI Ethics & Society for 2025?

(I used to ask about this every year on Twitter haha.)
Reposted by andrea wang
dmimno.bsky.social
Best Student Paper at #AIES 2024 went to @andreawwenyi.bsky.social! Annotating gender-biased narratives in the courtroom is a complex, nuanced task with frequent subjective decision-making by legal experts. We asked: What do experts desire from a language model in this annotation process?
andreawwenyi.bsky.social
Lots of exciting open questions from this work, e.g. 1) The effect of pre-training and model architectures on representations of languages and 2) The applications of cross-lingual representations embedded in language models.
andreawwenyi.bsky.social
Embedding geometries are similar across model families and scales, as measured by canonical angles. XLM-R models are extremely similar to each other, as well as mT5-small and base. All models are far from random (0.14–0.27).
andreawwenyi.bsky.social
The diversity of neighborhoods in mT5 varies by category. For tokens in two Japanese writing systems: KATAKANA, for words of foreign origin, has more diverse neighbors than HIRAGANA, used for native Japanese words.
andreawwenyi.bsky.social
The nearest neighbors of mT5 tokens are often translations. NLP spent 10 years trying to make word embeddings align across languages. mT5 embeddings find cross-lingual semantic alignment without even being asked!
andreawwenyi.bsky.social
mT5 embeddings neighborhoods are more linguistically diverse: the 50 nearest neighbors for any token represent an average of 7.61 writing systems, compared to 1.64 with XLM-R embedding.
andreawwenyi.bsky.social
Tokens in different writing systems can be linearly separated with an average of 99.2 % accuracy for XLM. Even in high-dimensional space, mT5 embeddings are less separable.