Manya Wadhwa
@manyawadhwa.bsky.social
180 followers 170 following 17 posts
PhD at UTCS | #NLP https://manyawadhwa.github.io/
Posts Media Videos Starter Packs
Pinned
manyawadhwa.bsky.social
Evaluating language model responses on open-ended tasks is hard! 🤔

We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️.

EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
Reposted by Manya Wadhwa
kmahowald.bsky.social
UT Austin Linguistics is hiring in computational linguistics!

Asst or Assoc.

We have a thriving group sites.utexas.edu/compling/ and a long proud history in the space. (For instance, fun fact, Jeff Elman was a UT Austin Linguistics Ph.D.)

faculty.utexas.edu/career/170793

🤘
UT Austin Computational Linguistics Research Group – Humans processing computers processing humans processing language
sites.utexas.edu
Reposted by Manya Wadhwa
gregdnlp.bsky.social
Find my students and collaborators at COLM this week!

Tuesday morning: @juand-r.bsky.social and @ramyanamuduri.bsky.social 's papers (find them if you missed it!)

Wednesday pm: @manyawadhwa.bsky.social 's EvalAgent

Thursday am: @anirudhkhatry.bsky.social 's CRUST-Bench oral spotlight + poster
manyawadhwa.bsky.social
Unfortunately I won't be at #COLM2025 this week, but please check out our work being presented by my collaborators/advisors!

If you are interested in evals of open-ended tasks/creativity please reach out and we can schedule a chat! :)
gregdnlp.bsky.social
Find my students and collaborators at COLM this week!

Tuesday morning: @juand-r.bsky.social and @ramyanamuduri.bsky.social 's papers (find them if you missed it!)

Wednesday pm: @manyawadhwa.bsky.social 's EvalAgent

Thursday am: @anirudhkhatry.bsky.social 's CRUST-Bench oral spotlight + poster
Reposted by Manya Wadhwa
juand-r.bsky.social
Excited to present this at #COLM2025 tomorrow! (Tuesday, 11:00 AM poster session)
juand-r.bsky.social
One of the ways that LLMs can be inconsistent is the "generator-validator gap," where LLMs deem their own answers incorrect.

🎯 We demonstrate that ranking-based discriminator training can significantly reduce this gap, and improvements on one task often generalize to others!

🧵👇
A visualization of the generator-validator gap, where the LM likelihoods of for the generator and discriminator forms of questions are poorly correlated. Aligning the validator and generator rankings can fix it!
Reposted by Manya Wadhwa
markar.bsky.social
Come to talk with us today about the evaluation of long form multilingual generation at the second poster session #COLM2025

📍4:30–6:30 PM / Room 710 – Poster #8
Reposted by Manya Wadhwa
cmalaviya.bsky.social
Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses?

Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓
Reposted by Manya Wadhwa
esteng.bsky.social
Extremely excited to announce that I will be joining
@utaustin.bsky.social Computer Science in August 2025 as an Assistant Professor! 🎉
UT Austin campus
Reposted by Manya Wadhwa
vishakhpk.bsky.social
What does it mean for #LLM output to be novel?
In work w/ johnchen6.bsky.social, Jane Pan, Valerie Chen and He He, we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵
Reposted by Manya Wadhwa
juand-r.bsky.social
How do language models organize concepts and their properties? Do they use taxonomies to infer new properties, or infer based on concept similarities? Apparently, both!

🌟 New paper with my fantastic collaborators @amuuueller.bsky.social and @kanishka.bsky.social
Title: "Characterizing the Role of Similarity in the Property Inferences of Language Models"
Authors: Juan Diego Rodriguez, Aaron Mueller, Kanishka Misra

Left figure: "Given that dogs are daxable, is it true that corgis are daxable?" A language model could answer this either using taxonomic relations, illustrated by a taxonomy dog-corgi, dog-mutt, canine-wolf, etc., or by similarity relations (dogs are more similar to corgis than cats, wolves or shar peis).

Right figure: illustration of the causal model (and an example intervention) for distributed alignment search (DAS), which we used to find a subspace in the network responsible for property inheritance behavior. The bottom nodes are "property", "premise concept (A)" and "conclusion concept (B)" , the middle nodes are "A has property P", "B is a kind of A", and the top node is "B has property P".
Reposted by Manya Wadhwa
kanishka.bsky.social
If you are at #NAACL2025 @naaclmeeting.bsky.social catch @juand-r.bsky.social presenting our poster on the interplay between similarity and category membership in the property inferences of LMs @ Poster Session 1 on Wednesday!

Or if you're at home like me, read our paper: arxiv.org/abs/2410.22590
juand-r.bsky.social
How do language models organize concepts and their properties? Do they use taxonomies to infer new properties, or infer based on concept similarities? Apparently, both!

🌟 New paper with my fantastic collaborators @amuuueller.bsky.social and @kanishka.bsky.social
Title: "Characterizing the Role of Similarity in the Property Inferences of Language Models"
Authors: Juan Diego Rodriguez, Aaron Mueller, Kanishka Misra

Left figure: "Given that dogs are daxable, is it true that corgis are daxable?" A language model could answer this either using taxonomic relations, illustrated by a taxonomy dog-corgi, dog-mutt, canine-wolf, etc., or by similarity relations (dogs are more similar to corgis than cats, wolves or shar peis).

Right figure: illustration of the causal model (and an example intervention) for distributed alignment search (DAS), which we used to find a subspace in the network responsible for property inheritance behavior. The bottom nodes are "property", "premise concept (A)" and "conclusion concept (B)" , the middle nodes are "A has property P", "B is a kind of A", and the top node is "B has property P".
Reposted by Manya Wadhwa
anirudhkhatry.bsky.social
🚀Meet CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️
A dataset of 100 real-world C repositories across various domains, each paired with:
🦀 Handwritten safe Rust interfaces.
🧪 Rust test cases to validate correctness.
🧵[1/6]
manyawadhwa.bsky.social
📝 Read the full paper: arxiv.org/pdf/2504.15219
💻 You can also use our system to generate criteria: github.com/ManyaWadhwa/...
Also checkout our 🎛️ UI to explore generated criteria + source URLs!
manyawadhwa.bsky.social
Why do we need this? If you’ve used an LLM to draft a paper intro, research talk, or blog post, you’ve likely noticed that while the facts are correct, something feels off. What might be missing are the subtle cues and unspoken expectations. EvalAgent helps uncover and address those hidden layers! 🔮
manyawadhwa.bsky.social
EvalAgent (EA-Web) criteria are often non-obvious to humans and not easily met by LLMs out of the box, making them valuable for evaluation. We also show how the criteria generated by EvalAgent is highly actionable (results in paper)!
manyawadhwa.bsky.social
We test criteria generated by EvalAgent across 9 datasets from creative writing to technical reports and compare criteria generated by 2 other systems!

Results? We show that the criteria generated by EvalAgent (EA-Web) are 🎯 highly specific and 💭 implicit.
manyawadhwa.bsky.social
For example, EvalAgent generates the following criteria for the academic talk prompt:

The response should have:
🪄 A compelling opening/motivation
🧠 Clear research question it answers
🏁 A strong conclusion that restates findings
manyawadhwa.bsky.social
EvalAgent emulates how a human would seek advice by 🔍searching 🔍 things like “how to write a compelling talk”, reading expert tips from blogs and academic websites and aggregating that into specific, useful evaluation criteria.
manyawadhwa.bsky.social
Take a prompt "Help me draft an academic talk on coffee intake vs research productivity." We know the output should be factual. But how do we identify less obvious features that are not in the prompt, like structure of the talk? That’s where EvalAgent steps in!
manyawadhwa.bsky.social
EvalAgent uncovers criteria for evaluating LLM responses on open-ended tasks by:

📌 Decomposing the user prompt into key conceptual queries
🌐 Searching the web for expert advice and summarizing it
📋 Aggregating web-retrieved information into specific and actionable evaluation criteria
manyawadhwa.bsky.social
Evaluating language model responses on open-ended tasks is hard! 🤔

We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️.

EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
Reposted by Manya Wadhwa
juand-r.bsky.social
One of the ways that LLMs can be inconsistent is the "generator-validator gap," where LLMs deem their own answers incorrect.

🎯 We demonstrate that ranking-based discriminator training can significantly reduce this gap, and improvements on one task often generalize to others!

🧵👇
A visualization of the generator-validator gap, where the LM likelihoods of for the generator and discriminator forms of questions are poorly correlated. Aligning the validator and generator rankings can fix it!
Reposted by Manya Wadhwa
juand-r.bsky.social
1.) [NAACL 25] @kanishka.bsky.social, @amuuueller.bsky.social and I delve into how language models do property inheritance using behavioral and mechanistic analyses.
Thank you, Kanishka and Aaron. I could not have hoped for better collaborators! arxiv.org/abs/2410.22590

[👇 bsky.app/profile/juan...
Reposted by Manya Wadhwa
victorwang37.bsky.social
LLM judges have become ubiquitous, but valuable signal is often ignored at inference.

We analyze design decisions for leveraging judgment distributions from LLM-as-a-judge: 🧵

(w/ Michael J.Q. Zhang, @eunsol.bsky.social)
Reposted by Manya Wadhwa
jessyjli.bsky.social
Do you want to know what information LLMs prioritize in text synthesis tasks? Here's a short 🧵 about our new paper, led by Jan Trienes: an interpretable framework for salience analysis in LLMs.

First of all, information salience is a fuzzy concept. So how can we even measure it? (1/6)