Lightnews — Scholar-powered news

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · 17h

Thrilled to be part of this work led by
@adisimhi.bsky.social !

ManagerBench reveals a critical problem:
✅ LLMs can recognize harm
❌ But often choose it anyway to meet goals
🤖 Or overcorrect and become ineffective
We need better balance!

A must-read for safety folks!

Martin Tutek @mtutek.bsky.social · 17h

🤔What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?

🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵

3

Reposted by Itay Itzhak @ COLM 🍁

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

Traveling to #COLM2025 this week, and here's some work from our group and collaborators:
Cognitive biases, hidden knowledge, CoT faithfulness, model editing, and LM4Science
See the thread for details and reach out if you'd like to discuss more!

1 1 6

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 30

At #ACL2025 and not sure what to do next? GEM 💎² is the place to be for awesome talks on the future of LLM evaluation. Come hear @GabiStanovsky, @EliyaHabba, @LChoshen and others rethink what it means to actually evaluate LLMs beyond accuracy and vibes. Thursday @ Hall C!

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 27

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!

1 3

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

@boknilev.bsky.social @gabistanovsky.bsky.social

1

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

Huge thanks to my co-authors
@boknilev @GabiStanovsky!
Preprint: arxiv.org/abs/2507.07186
Webpage: itay1itzhak.github.io/planted-in-...
We’d love your thoughts, critiques, and ideas 📬
Let’s talk about building more interpretable and trustworthy LLMs!
#NLProc #Bias #CognitiveAI

Planted in Pretraining, Swayed by Finetuning: A Case Study on the...

Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across...

arxiv.org

1

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

🧠 Takeaway:
Cognitive biases are not introduced during instruction tuning.
They’re planted in pretraining and only surfaced by finetuning.
If we want fairer models, we need to look deeper into the pretraining pipeline.

1 2

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

🔄 Step 2: Cross-tuning.
We swap instruction datasets between models with different pretraining.
Result: Biases follow the pretrained model!

PCA clearly shows models group by pretraining base, not by instruction.
The bias “signature” stays intact, no matter the finetuning!

1

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

🎲 Step 1: Training randomness.
We finetune the same model 3× with different seeds.
Result: Some variation in bias scores, but behavior patterns stay stable compared to MMLU variance.
✅ Aggregating across seeds reveals consistent trends.

1 3

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

🧪 We introduce a two-step causal framework to disentangle the effects of:
- Pretraining
- Instruction tuning
- Training randomness

- 🍁 Bottom line - pretraining is the origin of bias. Finetuning? Just the messenger
#CausalInference #TrustworthyAI #NLP

1 1

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

🚨New paper alert🚨

🧠
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?

Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc

1 1 4

Reposted by Itay Itzhak @ COLM 🍁

Fazl Barez @fbarez.bsky.social · Jul 1

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵

2 31 83

Reposted by Itay Itzhak @ COLM 🍁

Sebastian Gehrmann @sebgehr.bsky.social · Mar 24

Are you recovering from your @colmweb.org abstract submission? GEM has a non-archival track that allows you to submit a two-page abstract in parallel?

Our workshop deadline is soon, please consider submitting your evaluation paper!

You can find our call for papers at gem-benchmark.com/workshop

1 1

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Mar 17

New paper alert!

Curious how small prompt tweaks impact LLM accuracy but don’t want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.

Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!

Eliya Habba @eliyahabba.bsky.social · Mar 17

Care about LLM evaluation? 🤖 🤔

We bring you ️️🕊️ DOVE a massive (250M!) collection of LLMs outputs
On different prompts, domains, tokens, models...

Join our community effort to expand it with YOUR model predictions & become a co-author!

1 4

Reposted by Itay Itzhak @ COLM 🍁

Tal Haklay @talhaklay.bsky.social · Mar 6

1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇

1 7 25

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Mar 5

Super interesting! Have you tested how LAP handles more diverse paraphrasing? For example, do you think it would also work for code functions with similar roles?

1 1

Reposted by Itay Itzhak @ COLM 🍁

Martin Tutek @mtutek.bsky.social · Feb 21

🚨🚨 New preprint 🚨🚨

Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?

We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.

arxiv.org/abs/2502.14829

Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite mu...

arxiv.org

2 13 46

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Feb 19

We usually blame hallucinations on uncertainty or missing knowledge. But what if I told you that LLMs hallucinate even when they *know* the correct answer - and they do it with *high certainty* 🤯?
Check out our new paper that challenges assumptions on AI trustworthiness! 🧵👇

Adi Simhi @adisimhi.bsky.social · Feb 19

🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov

2

Reposted by Itay Itzhak @ COLM 🍁

Sebastian Gehrmann @sebgehr.bsky.social · Feb 12

GEM is so back! Our workshop for Generation, Evaluation, and Metrics is coming to an ACL near you.

Evaluation in the world of GenAI is more important than ever, so please consider submitting your amazing work.

CfP can be found at gem-benchmark.com/workshop

5 9

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Dec 10

Why not try the straightforward approach: label high-quality texts and train an LM to classify them? Of course this should be done separately for different types of texts - a great scientific paper ≠ a great novel.
(Similar to how Llama 3 pretraining used quality scores from Llama 2 and Roberta)

1