Lightnews — Scholar-powered news

Reposted by Yonatan Belinkov ✈️ COLM2025

Martin Tutek @mtutek.bsky.social · 14h

🤔What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?

🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵

1 2 8

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

Reach out if you'd like to discuss anything related to model interpretability and controllability, robustness, multi-agent communication, biological LMs, etc.
Also happy to talk about PhD and Post-doc opportunities!

2

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

In the #LM4Science workshop, Friday 111:30am-1pm, we have a poster on generating rich text descriptions from protein sequences, work by Edo Dotan, who couldn't travel.
Preprint: www.biorxiv.org/content/10.1...

Protein2Text: Providing Rich Descriptions for Protein Sequences

Understanding the functionality of proteins has been a focal point of biological research due to their critical roles in various biological processes. Unraveling protein functions is essential for adv...

www.biorxiv.org

1 2

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

In #Interplay25 workshop, Friday ~11:30, I'll present on measuring *parametric* CoT faithfulness on behalf of
@mtutek.bsky.social , who couldn't travel:
bsky.app/profile/mtut...

Later that day we'll have a poster on predicting success of model editing by Yanay Soker, who also couldn't travel

1 1 4

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

In the LLM explainability in reasoning and planning workshop, Friday 9am, I'll talk about scaling up model interpretability (xllm-reasoning-planning-workshop.github.io)

XLLM-Reason-Plan

Website for the Workshop on the Application of LLM Explainability to Reasoning and Planning at COLM 2025

xllm-reasoning-planning-workshop.github.io

1 1

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

@zorikgekhman.bsky.social
will present on Wednesday a poster on hidden factual knowledge in LMs
bsky.app/profile/zori...

Zorik Gekhman @zorikgekhman.bsky.social · Mar 31

🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”?

In our new paper, we clearly define this concept and design controlled experiments to test it.
1/🧵

1 2

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

@itay-itzhak.bsky.social
presenting today, morning, a spotlight talk and poster on the origin of cognitive biases in LLMs
bsky.app/profile/itay...

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 15

🚨New paper alert🚨

🧠
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?

Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc

1 3

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 1d

Traveling to #COLM2025 this week, and here's some work from our group and collaborators:
Cognitive biases, hidden knowledge, CoT faithfulness, model editing, and LM4Science
See the thread for details and reach out if you'd like to discuss more!

1 1 6

Reposted by Yonatan Belinkov ✈️ COLM2025

Aaron Mueller @amuuueller.bsky.social · 7d

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

2 14 38

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 7d

Opportunities to join my group in fall 2026:
* PhD applications direct or via ELLIS @ellis.eu (ellis.eu/news/ellis-p...)
* Post-doc applications direct or via Azrieli (azrielifoundation.org/fellows/inte...) or Zuckerman (zuckermanstem.org/ourprograms/...)

1 3

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 16d

Excited to join @KempnerInst this year!
Get in touch if you're in the Boston area and want to chat about anything related to AI interpretability, robustness, interventions, safety, multi-modality, protein/DNA LMs, new architectures, multi-agent communication, or anything else you're excited about!

Kempner Institute at Harvard University @kempnerinstitute.bsky.social · 16d

News from the #KempnerInstitute!

We’re thrilled to welcome Yonatan Belinkov (expert in #NLP) and Daphna Weinshall (expert in human & machine vision) as visiting scholars for the 2025–26 academic year.

📖 Read more: bit.ly/47QkDID

#AI #MachineVision @boknilev.bsky.social

Kempner Institute Welcomes Yonatan Belinkov and Daphna Weinshall as Visiting Scholars for the 2025-2026 Academic Year - Kempner Institute

The Kempner Institute is pleased to welcome Yonatan Belinkov and Daphna Weinshall as visiting scholars for the 2025-26 academic year. Both are preeminent researchers in the field of intelligence, help...

bit.ly

2

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · 20d

@robinjia.bsky.social
speaking at
@kempnerinstitute.bsky.social
on Auditing, Dissecting, and Evaluating LLMs

1 1

Reposted by Yonatan Belinkov ✈️ COLM2025

Martin Tutek @mtutek.bsky.social · Aug 21

Thrilled that FUR was accepted to @emnlpmeeting.bsky.social Main🎉

In case you can’t wait so long to hear about it in person, it will also be presented as an oral at @interplay-workshop.bsky.social @colmweb.org 🥳

FUR is a parametric test assessing whether CoTs faithfully verbalize latent reasoning.

1 3 13

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · Aug 12

BlackboxNLP is the workshop on interpreting and analyzing NLP models (including LLMs, VLMs, etc). We accept full (archival) papers and extended abstracts.

The workshop is highly attended and is a great exposure for your finished work or feedback on work in progress.

#emnlp2025 at Sujhou, China!

BlackboxNLP @blackboxnlp.bsky.social · Aug 12

📢 Call for Papers! 📢
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.

📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/

1 5

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · Jul 13

Join our Discord for discussions and a bunch of simple submission ideas you can try!
discord.gg/n5uwjQcxPR

Participants will have the option to write a system description paper that gets published.

2

Reposted by Yonatan Belinkov ✈️ COLM2025

BlackboxNLP @blackboxnlp.bsky.social · Jul 9

Have you started working on your submission for the MIB shared task yet? Tell us what you’re exploring!

New featurization methods?
Circuit pruning?
Better feature attribution?

We'd love to hear about it 👇

1 2

Reposted by Yonatan Belinkov ✈️ COLM2025

BlackboxNLP @blackboxnlp.bsky.social · Jul 8

Working on feature attribution, circuit discovery, feature alignment, or sparse coding?
Consider submitting your work to the MIB Shared Task, part of this year’s #BlackboxNLP

We welcome submissions of both existing methods and new or experimental POCs!

1 3 5

Reposted by Yonatan Belinkov ✈️ COLM2025

Dana Arad @danaarad.bsky.social · Jun 26

VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?

In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵

1 4 6

Reposted by Yonatan Belinkov ✈️ COLM2025

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

2 6 18

Reposted by Yonatan Belinkov ✈️ COLM2025

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

🚨New paper at #ACL2025 Findings!
REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space.
LMs memorize and leak sensitive data—emails, SSNs, URLs from their training.
We propose a surgical method to unlearn it.
🧵👇w/ @boknilev.bsky.social @mtutek.bsky.social
1/8

1 2 6

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · May 15

Interested in mechanistic interpretability and care about evaluation? Please consider submitting to our shared task at #blackboxNLP this year!

BlackboxNLP @blackboxnlp.bsky.social · May 15

BlackboxNLP, the leading workshop on interpretability and analysis of language models, will be co-located with EMNLP 2025 in Suzhou this November! 📆

This edition will feature a new shared task on circuits/causal variable localization in LMs, details here: blackboxnlp.github.io/2025/task

1 6

Reposted by Yonatan Belinkov ✈️ COLM2025

Ana Marasović @anamarasovic.bsky.social · May 4

Slides available here: docs.google.com/presentation...

1 5 25

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · Apr 25

I think a scalable open source implementation would have many uses! Let’s say I can’t run all pretraining data because of cost. And I run a subset and get influential examples. What would that mean concerning what I’m missing?

1 1

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · Apr 25

Looks great! What would it take to run this on another model and dataset?

1 1

Yonatan Belinkov ✈️ COLM2025 @boknilev.bsky.social · Apr 24

This has been a huge team effort with many talented contributors. Very thankful for everyone’s contributions!

See the list here:
bsky.app/profile/amuu...

Aaron Mueller @amuuueller.bsky.social · Apr 23

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Iván Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

1