Lightnews — Scholar-powered news

Julian Minder @jkminder.bsky.social · Sep 5

Further research into these organisms is needed, although our preliminary investigations suggest that solutions may be straightforward. We will continue to work on this and provide a more detailed analysis soon.

Blogpost: www.alignmentforum.org/posts/sBSjEB... (8/8)

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences — AI Alignment Forum

This is a preliminary research update. We are continuing our investigation and will publish a more in-depth analysis soon. The work was done as part…

www.alignmentforum.org

1

Julian Minder @jkminder.bsky.social · Sep 5

Takeaways: Narrow-finetuned “organisms” may poorly reflect broad, real-world training. They encode domain info that shows up even on unrelated inputs. (7/8)

1 1

Julian Minder @jkminder.bsky.social · Sep 5

Ablations: Mixing unrelated chat data or shrinking the finetune set weakens the signal—consistent with overfitting. (6/8)

1 1

Julian Minder @jkminder.bsky.social · Sep 5

Agent: The interpretability agent uses these signals to identify finetuning objectives with high accuracy by asking a few questions to the model to refine it’s hypothesis, outperforming black-box baselines. (5/8)

1 1

Julian Minder @jkminder.bsky.social · Sep 5

Result: Steering with these differences reproduces the finetuning data’s style and content on unrelated prompts. (4/8)

1 1

Julian Minder @jkminder.bsky.social · Sep 5

Result: Patchscope on these differences surfaces tokens tightly linked to the finetuning domain—no finetune data needed at inference. (3/8)

1

Julian Minder @jkminder.bsky.social · Sep 5

With @butanium.bsky.social @neelnanda.bsky.social Stewart Slocum
Setup: We compute per-position average activation differences between a base and finetuned model on unrelated text. Inspect with Patchscope and by steering the finetuned model with the differences. (2/8)

1 1

Julian Minder @jkminder.bsky.social · Sep 5

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

1 1 6

Julian Minder @jkminder.bsky.social · Sep 3

Very cool initiative!

Antoine Bosselut @abosselut.bsky.social · Sep 3

The next generation of open LLMs should be inclusive, compliant, and multilingual by design. That’s why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.

EPFL School of Computer and Communication Sciences @icepfl.bsky.social · Sep 2

EPFL, ETH Zurich & CSCS just released Apertus, Switzerland’s first fully open-source large language model.
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.

Read more: actu.epfl.ch/news/apertus...

Julian Minder @jkminder.bsky.social · Jul 17

Paper: arxiv.org/pdf/2507.08802

arxiv.org

2

Julian Minder @jkminder.bsky.social · Jul 17

What does this mean? Causal Abstraction - while still a promising framework - must explicitly constrain representational structure or include the notion of generalization, since our proof hinges on the existence of an extremely overfitted function.
More detailed thread: bsky.app/profile/deni...

Denis Sutter @denissutter.bsky.social · Jul 15

1/9 In our new interpretability paper, we analyse causal abstraction—the framework behind Distributed Alignment Search—and show it breaks when we remove linearity constraints on feature representations. We refer to this problem as the Non-Linear Representation Dilemma.

1 1

Julian Minder @jkminder.bsky.social · Jul 17

Our proofs show that, without assuming the linear representation hypothesis, any algorithm can be mapped onto any network. Experiments confirm this: e.g. by using highly non-linear representations we can map an Indirect-Object-Identification algorithm to randomly initialized language models.

1 1

Julian Minder @jkminder.bsky.social · Jul 17

Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @denissutter.bsky.social, T. Hofmann, @tpimentel.bsky.social ) that the theory collapses without the linear representation hypothesis—a problem we call the non-linear representation dilemma.

1 2 5

Reposted by Julian Minder

Tiago Pimentel @tpimentel.bsky.social · Jul 14

In this new paper, w/ @denissutter.bsky.social , @jkminder.bsky.social, and T.Hofmann, we study *causal abstraction*, a formal specification of when a deep neural network (DNN) implements an algorithm. This is the framework behind, e.g., distributed alignment search.

Paper: arxiv.org/abs/2507.08802

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level ...

arxiv.org

1 1 3

Reposted by Julian Minder

Tiago Pimentel @tpimentel.bsky.social · Jul 14

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵

Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.

1 12 62

Julian Minder @jkminder.bsky.social · Jun 30

Could this have caught OpenAI's sycophantic model update? Maybe!

Post: lesswrong.com/posts/xmpauE...

Paper Thread: bsky.app/profile/buta...

Paper: arxiv.org/abs/2504.02922

2

Julian Minder @jkminder.bsky.social · Jun 30

Our methods reveal interpretable features related to e.g. refusal detection, fake facts, or information about the model's identity. This highlights that model diffing is a promising research direction deserving more attention.

1

Julian Minder @jkminder.bsky.social · Jun 30

By comparing base and chat models, we found that one of the main existing technique (crosscoders) hallucinates differences due to how its sparsity is enforced. We fixed this and also found that just training an SAE on (chat - base) activations works surprisingly well.

1

Julian Minder @jkminder.bsky.social · Jun 30

With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

1 1 4

Julian Minder @jkminder.bsky.social · Apr 7

In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.

1 6

Reposted by Julian Minder

dribnet @drib.net · Dec 22

background: the technique here is "model-diffing" introduced by @anthropic.com just 8 weeks ago and quickly replicated by others. this includes an open source @hf.co model release by @butanium.bsky.social and @jkminder.bsky.social which I'm using. transformer-circuits.pub/2024/crossco...

Sparse Crosscoders for Cross-Layer Features and Model Diffing

transformer-circuits.pub

1 1 1

Reposted by Julian Minder

Manoel Horta Ribeiro @manoelhortaribeiro.bsky.social · Nov 27

New @acm-cscw.bsky.social paper, new content moderation paradigm.

Post Guidance lets moderators prevent rule-breaking by triggering interventions as users write posts!

We implemented PG on Reddit and tested it in a massive field experiment (n=97k). It became a feature!

arxiv.org/abs/2411.16814

5 26 55

Julian Minder @jkminder.bsky.social · Nov 22

10/ See the full paper for how this mechanism works!
arxiv.org/abs/2411.07404
I'm incredibly proud of this paper:) Huge thanks to all of my collaborators. Also sorry for the 🦋 repost:)

Controllable Context Sensitivity and the Knob Behind It

When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental functionality, a...

arxiv.org

3

Julian Minder @jkminder.bsky.social · Nov 22

9/ We further examine the models that have been fine-tuned for this task and find evidence that the fine-tuning appears learn how to set the knob that already exists in the model.

1

Julian Minder @jkminder.bsky.social · Nov 22

8/ 4. Learn a subspace to control the behavior in the found layer based on ideas from Distributed Alignment Search by Geiger et al..
We leveraged this recipe to find the 1D subspace in 3 different models: like Llama-3.1 , Mistral-v0.3 and Gemma-2.

1 1