Clément Dumas
@butanium.bsky.social
530 followers 210 following 43 posts
Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West MATS Winter 7.0 Scholar w/ neelnanda.bsky.social https://butanium.github.io
Posts Media Videos Starter Packs
Pinned
butanium.bsky.social
New paper w/@jkminder.bsky.social & @neelnanda.bsky.social
What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features!🧵
butanium.bsky.social
For more info check the blogpost / Julian's thread
butanium.bsky.social
Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!
butanium.bsky.social
The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!
butanium.bsky.social
To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*

Check our blogpost out! 🧵
jkminder.bsky.social
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
Reposted by Clément Dumas
jdp.extropian.net
GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.
Reposted by Clément Dumas
butanium.bsky.social
Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?
butanium.bsky.social
What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?
Reposted by Clément Dumas
ndif-team.bsky.social
Excited to share our first paper replication tutorial, walking you through the main figures from "Do Language Models Use Their Depth Efficiently?" by @robertcsordas.bsky.social

🔎 Demo on Colab: colab.research.google.com/github/ndif-...

📖 Read the full manuscript: arxiv.org/abs/2505.13898
Google Colab
colab.research.google.com
Reposted by Clément Dumas
jkminder.bsky.social
With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
butanium.bsky.social
Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea
butanium.bsky.social
Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
butanium.bsky.social
We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!
butanium.bsky.social
Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!
Reposted by Clément Dumas
girving.bsky.social
New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.
The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y_1, y_2, . . . , y_q , and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns
probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw.
butanium.bsky.social
We'll be presenting at the #ICLR sparsity in LLMs workshop today (Sunday 27th) at 4:30 pm in Hall 4 #7!
butanium.bsky.social
New paper w/@jkminder.bsky.social & @neelnanda.bsky.social
What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features!🧵
butanium.bsky.social
Want to explore cool chat related crosscoder latents?
With @jkminder.bsky.social, we made a demo that supports both loading our max activating examples AND running the crosscoder with your own prompt to collect the activations of specific latents!
Send us the cool latents you find! dub.sh/ccdm
Reposted by Clément Dumas
jkminder.bsky.social
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
butanium.bsky.social
Like Andy Arditi (andyrdt.com) & Cooper Leong (cooperleong00.github.io), we find template tokens (like ) matter enormously!
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!
butanium.bsky.social
Those latents can be used to steer the model’s behavior, e.g. by inducing different type of refusal!