Divya Shanmugam
@dmshanmugam.bsky.social
120 followers 190 following 27 posts
trying to build reliable models from unreliable data postdoc @ Cornell Tech, phd @ MIT dmshanmugam.github.io
Posts Media Videos Starter Packs
dmshanmugam.bsky.social
can't recommend highly enough!
emmapierson.bsky.social
🚨 New postdoc position in our lab at Berkeley EECS! 🚨

(please reshare)

We seek applicants with experience in language modeling who are excited about high-impact applications in the health and social sciences!

More info in thread

1/3
Reposted by Divya Shanmugam
monicaagrawal.bsky.social
Excited to be at #ICML2025 to present our paper on 'pragmatic misalignment' in (deployed!) RAG systems: narrowly "accurate" responses that can be profoundly misinterpreted by readers.

It's especially dangerous for consequential domains like medicine! arxiv.org/pdf/2502.14898
A person searching for risks of surgery. A traditional search engine would surface websites that would likely include both pros and cons of the surgery. However, RAG results only excerpt the cons.
Reposted by Divya Shanmugam
reniebird.bsky.social
I'll be presenting a position paper about consumer protection and AI in the US at ICML. I have a surprisingly optimistic take: our legal structures are stronger than I anticipated when I went to work on this issue in Congress.

Is everything broken rn? Yes. Will it stay broken? That's on us.
A poster for the paper "Position: Strong Consumer Protection is an Inalienable Defense for AI Safety in the United States"
Reposted by Divya Shanmugam
allisonkoe.bsky.social
🎉Excited to present our paper tomorrow at @facct.bsky.social, “Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese”, with @brucelyu17.bsky.social, Jiebo Luo and Jian Kang, revealing 🤖 LLM performance disparities. 📄 Link: arxiv.org/abs/2505.22645
"Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese" Abstract:

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (this https URL). Figure showing that three different LLMs (GPT-4o, Qwen-1.5, and Taiwan-LLM) may answer a prompt about pineapples differently when asked in Simplified Chinese vs. Traditional Chinese. Figure showing that LLMs disproportionately answer questions about regional-specific terms (like the word for "pineapple," which differs in Simplified and Traditional Chinese) correctly when prompted in Simplified Chinese as opposed to Traditional Chinese. Figure showing that LLMs have high variance of adhering to prompt instructions, favoring Traditional Chinese names over Simplified Chinese names in a benchmark task regarding hiring.
Reposted by Divya Shanmugam
shaily99.bsky.social
🖋️ Curious how writing differs across (research) cultures?
🚩 Tired of “cultural” evals that don't consult people?

We engaged with interdisciplinary researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗

📜 arxiv.org/abs/2506.00784

[1/11]
An overview of the work “Research Borderlands: Analysing Writing Across Research Cultures” by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We  survey and interview interdisciplinary researchers (§3) to develop a framework of writing norms that vary across research cultures (§4) and operationalise them using computational metrics (§5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (§6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (§7).
dmshanmugam.bsky.social
and... here is the actual GIF 🙈
dmshanmugam.bsky.social
it brings me tremendous joy you noticed!!!
dmshanmugam.bsky.social
Last but not least, thanks to Helen Lu, @swamiviv1, and John Guttag, my wonderful collaborators on this work! One of my last from the PhD 🥹
dmshanmugam.bsky.social
Empirically, TTA reduces prediction set sizes by 10-14% on average, with larger improvements for (1) classes with the largest prediction set sizes and (2) stronger coverage guarantees.
dmshanmugam.bsky.social
We also present a new finding on TTA that explains its value to conformal scores: it promotes the true class to be more likely even when it is predicted to be unlikely, which is valuable for conformal scores that rely on orderings over predicted probabilities (e.g. APS, RAPS)!
dmshanmugam.bsky.social
We show that test-time augmentation (TTA)—a classic vision technique—is a simple and surprisingly effective way to shrink sets while maintaining coverage. TTA aggregates predictions over transformations of an input (a neat way to create an ensemble out of a single classifier!)
dmshanmugam.bsky.social
New work 🎉: conformal classifiers return sets of classes for each example, with a probabilistic guarantee the true class is included. But these sets can be too large to be useful.

In our #CVPR2025 paper, we propose a method to make them more compact without sacrificing coverage.
A gif explaining the value of test-time augmentation to conformal classification. The video begins with an illustration of TTA reducing the size of the  predicted set of classes for a dog image, and goes on to explain that this is because TTA promotes the true class's predicted probability to be higher, even when it's predicted to be unlikely.
dmshanmugam.bsky.social
One place you can find me is Poster Session 4 on Saturday, at 5PM, presenting recent work on how you can use test-time augmentation to reduce the size of sets produced by conformal prediction. Full paper thread coming shortly :) here is the paper in the meantime: arxiv.org/abs/2505.22764
Test-time augmentation improves efficiency in conformal prediction
A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers p...
arxiv.org
dmshanmugam.bsky.social
I’m in Nashville this week for #CVPR2025! DM me to chat about conformal prediction, test-time adaptation, or model reliability. Excited to see new work and to catch up with friends old and new!!
Reposted by Divya Shanmugam
jennwv.bsky.social
Please help us spread the word! 📣

FATE is hiring a pre-doc research assistant! We're looking for candidates who will have completed their bachelor's degree (or equivalent) by summer 2025 and want to advance their research skills before applying to PhD programs.
Reposted by Divya Shanmugam
ericachiang.bsky.social
I really enjoyed (and learned a LOT from) working on this project with these wonderful co-authors:
@dmshanmugam.bsky.social
Ashley Beecy
Gabriel Sayer
@destrin.bsky.social
@nkgarg.bsky.social
@emmapierson.bsky.social
7/7
dmshanmugam.bsky.social
Erica’s new paper on a method to both measure *and* correct for three types of disparities associated with disease progression is now out! Check out the thread for more detail + findings from a case study on heart failure. Congratulations!!!
ericachiang.bsky.social
I’m really excited to share the first paper of my PhD, “Learning Disease Progression Models That Capture Health Disparities” (accepted at #CHIL2025)! ✨ 1/

📄: arxiv.org/abs/2412.16406
dmshanmugam.bsky.social
my friend jonah made a fun game that i now play everyday: guessten.com! please enjoy and send me your scores
GuessTen
guessten.com
dmshanmugam.bsky.social
just used this to source citations with great success - a very nice tool!!
ai2.bsky.social
Ai2 @ai2.bsky.social · Mar 26
Meet Ai2 Paper Finder, an LLM-powered literature search system.

Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍
Screenshot of the Ai2 Paper Finder interface
dmshanmugam.bsky.social
i’ve been wondering this too! thanks for asking
dmshanmugam.bsky.social
kenny had the great idea to spend a whole day analyzing dogs — so so fun! i like health data but turns out i love dog data
kennypeng.bsky.social
Our lab had a #dogathon 🐕 yesterday where we analyzed NYC Open Data on dog licenses. We learned a lot of dog facts, which I’ll share in this thread 🧵

1) Geospatial trends: Cavalier King Charles Spaniels are common in Manhattan; the opposite is true for Yorkshire Terriers.
Reposted by Divya Shanmugam
gsagostini.bsky.social
Migration data lets us study responses to environmental disasters, social change patterns, policy impacts, etc. But public data is too coarse, obscuring these important phenomena!

We build MIGRATE: a dataset of yearly flows between 47 billion pairs of US Census Block Groups. 1/5