Lightnews — Scholar-powered news

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

There’s a lot more detail in the full paper, and I would love to hear your thoughts and feedback on it!

Check out the preprint here: arxiv.org/pdf/2506.10150

arxiv.org

1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

Huge thanks to my amazing collaborators: Fai Poungpeth, @diyiyang.bsky.social, Erina Farrell, @brucelambert.bsky.social, and @mattgroh.bsky.social 🙌

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

LLMs, when benchmarked against reliable expert judgments, can be reliable tools for overseeing emotionally sensitive AI applications.

Our results show we can use LLMs-as-judge to monitor LLMs-as-companion!

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

For example, in one of the conversations in our dataset, a response that an expert saw as "dismissing” the speaker’s emotions, a crowdworker interpreted as "validating" their emotions instead!

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

These misjudgments from crowdworkers have huge implications for AI training and deployment❌

If we use flawed evaluations to train and monitor "empathic" AI, we risk creating systems that propagate a broken standard of what good communication looks like.

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

So why the gap between experts/LLMs and crowds?

Crowdworkers often
- have limited attention
- rely on heuristics like “it’s the thought that counts”
- focusing on intentions rather than actual wording
show systematic rating inflation due to social desirability bias

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

And when experts disagree, LLMs struggle to find a consistent signal too.

Here’s how expert agreement (Krippendorff's alpha) varied across empathy sub-components:

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

But here’s the catch: LLMs are reliable when experts are reliable.

The reliability of expert judgments depends on the clarity of the construct. For nuanced, subjective components of empathic communication, experts often disagree.

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

We analyzed thousands of annotations from LLMs, crowdworkers, and experts on 200 real-world conversations

And specifically looked at 21 sub-components of empathic communication from 4 evaluative frameworks

The result? LLMs consistently matched expert judgments better than crowdworkers did! 🔥

1 1

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Jun 17

How do we reliably judge if AI companions are performing well on subjective, context-dependent, and deeply human tasks? 🤖

Excited to share the first paper from my postdoc (!!) investigating when LLMs are reliable judges - with empathic communication as a case study 🧐

🧵👇

1 4

aakriti1kumar.bsky.social @aakriti1kumar.bsky.social · Apr 2

Super cool opportunity to work with brilliant scientists and fantastic mentors @mattgroh.bsky.social and Dashun Wang 🌟🌟

Feel free to reach out!

Matt Groh @mattgroh.bsky.social · Apr 2

📣 📣 Postdoc Opportunity at Northwestern

Dashun Wang and I are seeking a creative, technical, interdisciplinary researcher for a joint postdoc fellowship between our labs.

If you're passionate about Human-AI Collaboration and Science of Science, this may be for you! 🚀

Please share widely!

2

Reposted

Abhishek Sharma @abhishekshar.bsky.social · Jan 23

Our paper: Decision-Point Guided Safe Policy Improvement
We show that a simple approach to learn safe RL policies can outperform most offline RL methods. (+theoretical guarantees!)

How? Just allow the state-actions that have been seen enough times! 🤯

arxiv.org/abs/2410.09361

Decision-Point Guided Safe Policy Improvement

Within batch reinforcement learning, safe policy improvement (SPI) seeks to ensure that the learnt policy performs at least as well as the behavior policy that generated the dataset. The core challeng...

arxiv.org

1 3