Lightnews — Scholar-powered news

Reposted by Xuhui Zhou

Language Technologies Institute | CMU @ltiatcmu.bsky.social · Jun 26

New research from LTI, UMich, & Allen Institute for AI: LLMs don’t just hallucinate – sometimes, they lie. When truthfulness clashes with utility (pleasing users, boosting brands), models often mislead. @nlpxuhui.bsky.social and @maartensap.bsky.social discuss the paper:
lti.cmu.edu/news-and-eve...

Does Your Chatbot Swear to Tell the Truth? - Language Technologies Institute - School of Computer Science - Carnegie Mellon University

New research finds that LLM-based agents can't always be trusted to be truthful

lti.cmu.edu

2 3

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

Wonderful collaborations with Zhe Su, Anubha Kabra, Sanketh Rangreji, @jmendelsohn2.bsky.social , @faeze_brh
, @maartensap.bsky.social

2

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

Check out our paper to learn more about how LLMs navigate these ethical dilemmas: arxiv.org/abs/2409.09013 . 7/

#AI #MachineLearning #AIEthics #LLMs #nlp #NLProc #NAACL2025

AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents

To be safely and successfully deployed, LLMs must simultaneously satisfy truthfulness and utility goals. Yet, often these two goals compete (e.g., an AI agent assisting a used car salesman selling a c...

arxiv.org

1 1 6

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

🔄 Multi-turn interactive setup is crucial - models often begin with equivocation but shift to falsification when pressed for clear answers 🧠 Stronger models like GPT-4o showed the greatest shift when prompted to deceive (40% increase in falsification; alarming) 6/

1

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

⚠️ Even when explicitly instructed to be truthful, models STILL lied - GPT-4o still falsified info 15% of the time! 📉 The tradeoff is real: more honest models completed their goals 15% less often 5/

1 1

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

💼 In business scenarios (selling defective products), models were either completely honest OR completely deceptive 🌐 In public image scenarios (reputation management), behaviors were more ambiguous and complex 4/

1 1

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

And what we found: 📊 ALL tested models (GPT-4o, LLaMA-3, Mixtral) were truthful less than 50% of the time in conflict scenarios 🤔 Models prefer "partial lies" like equivocation over outright falsification - they'll dodge questions before explicitly lying 3/

2 3 3

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

Obviously this is a pressing issue now: x.com/deedydas/sta...; x.com/DanHendrycks... And here, we put LLMs into a multi-turn dialogue environment mimic the realistic setting where users constantly try to seek info from LLMs 2/

1 2

Xuhui Zhou @nlpxuhui.bsky.social · Apr 28

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! 🤯 1/

1 9 25

Reposted by Xuhui Zhou

Joel Mire @joelmire.bsky.social · Mar 6

Reward models for LMs are meant to align outputs with human preferences—but do they accidentally encode dialect biases? 🤔

Excited to share our paper on biases against African American Language in reward models, accepted to #NAACL2025 Findings! 🎉

Paper: arxiv.org/abs/2502.12858 (1/10)

Screenshot of Arxiv paper title, "Rejected Dialects: Biases Against African American Language in Reward Models," and author list: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap.

1 11 37

Reposted by Xuhui Zhou

Hao Zhu 朱昊 @zhuhao.me · Mar 4

We are getting closer to have agents operating in the real physical world. However, can we trust frontier models to make embodied decisions 🎮 aligned with human norms 👩‍⚖️ ?

With EgoNormia, a 1.8k ego-centric video 🥽 QA benchmark, we show that this is surprisingly challenging!