Itay Itzhak @ COLM 🍁
@itay-itzhak.bsky.social
25 followers 68 following 14 posts
NLProc, deep learning, and machine learning. Ph.D. student @ Technion and The Hebrew University. https://itay1itzhak.github.io/
Posts Media Videos Starter Packs
itay-itzhak.bsky.social
Thrilled to be part of this work led by
@adisimhi.bsky.social !

ManagerBench reveals a critical problem:
✅ LLMs can recognize harm
❌ But often choose it anyway to meet goals
🤖 Or overcorrect and become ineffective
We need better balance!

A must-read for safety folks!
mtutek.bsky.social
🤔What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?

🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵
Reposted by Itay Itzhak @ COLM 🍁
boknilev.bsky.social
Traveling to #COLM2025 this week, and here's some work from our group and collaborators:
Cognitive biases, hidden knowledge, CoT faithfulness, model editing, and LM4Science
See the thread for details and reach out if you'd like to discuss more!
itay-itzhak.bsky.social
At #ACL2025 and not sure what to do next? GEM 💎² is the place to be for awesome talks on the future of LLM evaluation. Come hear @GabiStanovsky, @EliyaHabba, @LChoshen and others rethink what it means to actually evaluate LLMs beyond accuracy and vibes. Thursday @ Hall C!
itay-itzhak.bsky.social
In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!
itay-itzhak.bsky.social
🧠 Takeaway:
Cognitive biases are not introduced during instruction tuning.
They’re planted in pretraining and only surfaced by finetuning.
If we want fairer models, we need to look deeper into the pretraining pipeline.
itay-itzhak.bsky.social
🔄 Step 2: Cross-tuning.
We swap instruction datasets between models with different pretraining.
Result: Biases follow the pretrained model!

PCA clearly shows models group by pretraining base, not by instruction.
The bias “signature” stays intact, no matter the finetuning!
itay-itzhak.bsky.social
🎲 Step 1: Training randomness.
We finetune the same model 3× with different seeds.
Result: Some variation in bias scores, but behavior patterns stay stable compared to MMLU variance.
✅ Aggregating across seeds reveals consistent trends.
itay-itzhak.bsky.social
🧪 We introduce a two-step causal framework to disentangle the effects of:
- Pretraining
- Instruction tuning
- Training randomness

- 🍁 Bottom line - pretraining is the origin of bias. Finetuning? Just the messenger
#CausalInference #TrustworthyAI #NLP
itay-itzhak.bsky.social
🚨New paper alert🚨

🧠
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?

Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc
Reposted by Itay Itzhak @ COLM 🍁
fbarez.bsky.social
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
Reposted by Itay Itzhak @ COLM 🍁
sebgehr.bsky.social
Are you recovering from your @colmweb.org abstract submission? GEM has a non-archival track that allows you to submit a two-page abstract in parallel?

Our workshop deadline is soon, please consider submitting your evaluation paper!

You can find our call for papers at gem-benchmark.com/workshop
itay-itzhak.bsky.social
New paper alert!

Curious how small prompt tweaks impact LLM accuracy but don’t want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.

Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!
eliyahabba.bsky.social
Care about LLM evaluation? 🤖 🤔

We bring you ️️🕊️ DOVE a massive (250M!) collection of LLMs outputs 
On different prompts, domains, tokens, models...

Join our community effort to expand it with YOUR model predictions & become a co-author!
Reposted by Itay Itzhak @ COLM 🍁
talhaklay.bsky.social
1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
itay-itzhak.bsky.social
Super interesting! Have you tested how LAP handles more diverse paraphrasing? For example, do you think it would also work for code functions with similar roles?
Reposted by Itay Itzhak @ COLM 🍁
mtutek.bsky.social
🚨🚨 New preprint 🚨🚨

Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?

We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.

arxiv.org/abs/2502.14829
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps
When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite mu...
arxiv.org
itay-itzhak.bsky.social
We usually blame hallucinations on uncertainty or missing knowledge. But what if I told you that LLMs hallucinate even when they *know* the correct answer - and they do it with *high certainty* 🤯?
Check out our new paper that challenges assumptions on AI trustworthiness! 🧵👇
adisimhi.bsky.social
🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
Reposted by Itay Itzhak @ COLM 🍁
sebgehr.bsky.social
GEM is so back! Our workshop for Generation, Evaluation, and Metrics is coming to an ACL near you.

Evaluation in the world of GenAI is more important than ever, so please consider submitting your amazing work.

CfP can be found at gem-benchmark.com/workshop
itay-itzhak.bsky.social
Why not try the straightforward approach: label high-quality texts and train an LM to classify them? Of course this should be done separately for different types of texts - a great scientific paper ≠ a great novel.
(Similar to how Llama 3 pretraining used quality scores from Llama 2 and Roberta)