Lightnews — Scholar-powered news

Reposted by Joachim Baumann

Claire Gillan @clairegillan.bsky.social · 13d

Looks interesting! We have been facing this exact issue - finding big inconsistencies across different LLMs rating the same text.

Joachim Baumann @joachimbaumann.bsky.social · 26d

🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**.

Paper: arxiv.org/pdf/2509.08825

$We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation". We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks. For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations. Then, we collect 13 million LLM annotations across plausible LLM configurations. These annotations feed into 1.4 million regressions testing the hypotheses. For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions. Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors. Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models. Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.$

5 4

Reposted by Joachim Baumann

Social Computing Group - UZH @scg-uzh.bsky.social · 21d

About last week’s internal hackathon 😏
Last week, we -- the (Amazing) Social Computing Group, held an internal hackathon to work on our informally called “Cultural Imperialism” project.

1 1 3

Reposted by Joachim Baumann

Johannes B. Gruber @jbgruber.bsky.social · 23d

If you feel uneasy using LLMs for data annotation, you are right (if not, you should). It offers new chances for research that is difficult with traditional #NLP/#textasdata methods, but the risk of false conclusions is high!

Experiment + *evidence-based* mitigation strategies in this preprint 👇

Joachim Baumann @joachimbaumann.bsky.social · 26d

🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**.

Paper: arxiv.org/pdf/2509.08825

$We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation". We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks. For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations. Then, we collect 13 million LLM annotations across plausible LLM configurations. These annotations feed into 1.4 million regressions testing the hypotheses. For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions. Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors. Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models. Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.$

1 4 22

Joachim Baumann @joachimbaumann.bsky.social · 24d

The 94% LLM hacking success rate is achieved by annotating data with several model-prompt configs, then choosing the one that yields the desired result (70% if considering SOTA models only).
The 31-50% risk reflects well-intentioned researchers who just run one reasonable config w/o cherry-picking.

Joachim Baumann @joachimbaumann.bsky.social · 24d

Thank you, Florian :) We use two methods, CDI and DSL. Both debias LLM annotations and reduce false positive conclusions to about 3-13%, on average, but at the cost of a much higher Type II risk (up to 92%). The human-only conclusions have a pretty low Type I risk as well, at a lower Type II risk.

2

Joachim Baumann @joachimbaumann.bsky.social · 26d

Great question! Performance and LLM hacking risk are negatively correlated. So easy tasks do have lower risk. But even tasks with 96% F1 score showed up to 16% risk of wrong conclusions. Validation is important because high annotation performance doesn't guarantee correct conclusions.

1 1 3

Joachim Baumann @joachimbaumann.bsky.social · 26d

We used 199 different prompts total: some from prior work, others based on human annotation guidelines, and some simple semantic paraphrases

Even when LLMs correctly identify significant effects, estimated effect sizes still deviate from true values by 40-77% (see Type M risk, Table 3 and Figure 3)

1

Joachim Baumann @joachimbaumann.bsky.social · 26d

Thank you to the amazing @paul-rottger.bsky.social @aurman21.bsky.social @albertwendsjo.bsky.social @florplaza.bsky.social @jbgruber.bsky.social @dirkhovy.bsky.social for this fun collaboration!!

3

Joachim Baumann @joachimbaumann.bsky.social · 26d

Why this matters: LLM hacking affects any field using AI for data analysis–not just computational social science!

Please check out our preprint, we'd be happy to receive your feedback!

#LLMHacking #SocialScience #ResearchIntegrity #Reproducibility #DataAnnotation #NLP #OpenScience #Statistics

1 1 8

Joachim Baumann @joachimbaumann.bsky.social · 26d

The good news: we found solutions that help mitigate this:
✅ Larger, more capable models are safer (but no guarantee).
✅ Few human annotations beat many AI annotations.
✅ Testing several models and configurations on held-out data helps.
✅ Pre-registering AI choices can prevent cherry-picking.

1 1 16

Joachim Baumann @joachimbaumann.bsky.social · 26d

- Researchers using SOTA models like GPT-4o face a 31-50% chance of false conclusions for plausible hypotheses.
- Risk peaks near significance thresholds (p=0.05), where 70% of "discoveries" may be false.
- Regression correction methods often don't work as they trade off Type I vs. Type II errors.

2 8

Joachim Baumann @joachimbaumann.bsky.social · 26d

We tested 18 LLMs on 37 social science annotation tasks (13M labels, 1.4M regressions). By trying different models and prompts, you can make 94% of null results appear statistically significant–or flip findings completely 68% of the time.

Importantly this also concerns well-intentioned researchers!

2 3 17

Joachim Baumann @joachimbaumann.bsky.social · 26d

🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**.

Paper: arxiv.org/pdf/2509.08825

$We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation". We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks. For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations. Then, we collect 13 million LLM annotations across plausible LLM configurations. These annotations feed into 1.4 million regressions testing the hypotheses. For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions. Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors. Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models. Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.$

5 94 260

Joachim Baumann @joachimbaumann.bsky.social · Jul 30

Not at this point, but the preprint should be ready soon

1

Joachim Baumann @joachimbaumann.bsky.social · Jul 29

The @milanlp.bsky.social group is presenting 15 papers (+ a toturial) at this year's #ACL2025 , go check them out :)
bsky.app/profile/mila...

MilaNLP Lab @milanlp.bsky.social · Jul 16

🎉 The @milanlp.bsky.social lab is excited to present 15 papers and 1 tutorial at #ACL2025 & workshops! Grateful to all our amazing collaborators, see everyone in Vienna! 🚀

1 4

Joachim Baumann @joachimbaumann.bsky.social · Jul 29

Shoutout to @tiancheng.bsky.social for yesterday's stellar presentation of our work benchmarking LLMs' ability to simulate group-level human behavior: bsky.app/profile/tian...

Tiancheng Hu @tiancheng.bsky.social · Jul 26

SimBench: Benchmarking the Ability of Large
Language Models to Simulate Human Behaviors, SRW Oral, Monday, July 28, 14:00-15:30

1 1 3

Joachim Baumann @joachimbaumann.bsky.social · Jul 29

I'm at #ACL2025 this week:📍Find me at the FEVER workshop, *Thursday 11am* 📝 presenting: "I Just Can't RAG Enough" - our ongoing work with @aurman21.bsky.social & @rer.bsky.social & Anikó Hannák, showing that RAG does not solve LLM fact-checking limitations!

2 2 5

Joachim Baumann @joachimbaumann.bsky.social · Jul 29

Breaking my social media silence because this news is too good not to share! 🎉
Just joined @milanlp.bsky.social as a Postdoc, working with the amazing @dirkhovy.bsky.social on large language models and computational social science!

1 1 12

Reposted by Joachim Baumann

MilaNLP Lab @milanlp.bsky.social · Jul 16

🎉 The @milanlp.bsky.social lab is excited to present 15 papers and 1 tutorial at #ACL2025 & workshops! Grateful to all our amazing collaborators, see everyone in Vienna! 🚀

8 11