Lightnews — Scholar-powered news

Clément Dumas @butanium.bsky.social · Sep 5

For more info check the blogpost / Julian's thread

Clément Dumas @butanium.bsky.social · Sep 5

Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!

1

Clément Dumas @butanium.bsky.social · Sep 5

The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!

1

Clément Dumas @butanium.bsky.social · Sep 5

To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*

Check our blogpost out! 🧵

Julian Minder @jkminder.bsky.social · Sep 5

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

1 1 4

Reposted by Clément Dumas

John David Pressman @jdp.extropian.net · Aug 29

GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.

1 9

Reposted by Clément Dumas

David Bau @davidbau.bsky.social · Aug 18

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

New England Mechanistic Interpretability Workshop

About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround...

www.youtube.com

1 7 16

Clément Dumas @butanium.bsky.social · Aug 8

Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?

1 1

Clément Dumas @butanium.bsky.social · Aug 6

What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?

2

Reposted by Clément Dumas

NDIF Team @ndif-team.bsky.social · Jul 4

Excited to share our first paper replication tutorial, walking you through the main figures from "Do Language Models Use Their Depth Efficiently?" by @robertcsordas.bsky.social

🔎 Demo on Colab: colab.research.google.com/github/ndif-...

📖 Read the full manuscript: arxiv.org/abs/2505.13898

Google Colab

colab.research.google.com

1 5

Reposted by Clément Dumas

Julian Minder @jkminder.bsky.social · Jun 30

With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

1 1 4

Clément Dumas @butanium.bsky.social · Jun 29

Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea

Clément Dumas @butanium.bsky.social · Jun 29

or more details, check out our paper on arXiv: arxiv.org/abs/2411.08745
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address...

arxiv.org

1 2

Clément Dumas @butanium.bsky.social · Jun 29

Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.

1

Clément Dumas @butanium.bsky.social · Jun 29

We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!

1

Clément Dumas @butanium.bsky.social · Jun 29

Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!

1

Clément Dumas @butanium.bsky.social · Jun 29

Our mech interp ICML workshop paper got accepted to ACL 2025 main! 🎉
In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵

Clément Dumas on X: "Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i" / X

Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i

x.com

1 1 8

Clément Dumas @butanium.bsky.social · Jun 26

*discord, right?

1

Clément Dumas @butanium.bsky.social · Jun 26

Asking an LLM with the right prompt is a good start imo (see e.g. www.lesswrong.com/posts/Gi8NP9...)

AI for Epistemics Hackathon — LessWrong

AI for Epistemics is about helping to leverage AI for better truthseeking mechanisms — at the level of individual users, the whole of society, or in…

www.lesswrong.com

1

Reposted by Clément Dumas

Geoffrey Irving @girving.bsky.social · Jun 17

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y_1, y_2, . . . , y_q , and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns
probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw.

1 1 8

Clément Dumas @butanium.bsky.social · Apr 26

We'll be presenting at the #ICLR sparsity in LLMs workshop today (Sunday 27th) at 4:30 pm in Hall 4 #7!

Clément Dumas @butanium.bsky.social · Apr 7

New paper w/@jkminder.bsky.social & @neelnanda.bsky.social
What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features!🧵

1

Clément Dumas @butanium.bsky.social · Apr 9

Want to explore cool chat related crosscoder latents?
With @jkminder.bsky.social, we made a demo that supports both loading our max activating examples AND running the crosscoder with your own prompt to collect the activations of specific latents!
Send us the cool latents you find! dub.sh/ccdm

1

Reposted by Clément Dumas

Julian Minder @jkminder.bsky.social · Apr 7

In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.

1 6

Clément Dumas @butanium.bsky.social · Apr 7

Full paper: arxiv.org/abs/2504.02922
This work was conducted during the MATS program with equal contribution with @jkminder.bsky.social, supervised by Bilal Chughtai (bilalchughtai.co.uk) and @neelnanda.bsky.social with help from @cadentj.bsky.social.
We'll be presenting at the ICLR SLLM workshop!

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promi...

arxiv.org

Clément Dumas @butanium.bsky.social · Apr 7

Like Andy Arditi (andyrdt.com) & Cooper Leong (cooperleong00.github.io), we find template tokens (like ) matter enormously!
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!

1

Clément Dumas @butanium.bsky.social · Apr 7

Those latents can be used to steer the model’s behavior, e.g. by inducing different type of refusal!

1