Julian Minder
@jkminder.bsky.social
1.1K followers 380 following 30 posts
PhD at EPFL with Robert West, Master at ETHZ Mainly interested in Language Model Interpretability and Model Diffing. MATS 7.0 Winter 2025 Scholar w/ Neel Nanda jkminder.ch
Posts Media Videos Starter Packs
jkminder.bsky.social
Takeaways: Narrow-finetuned “organisms” may poorly reflect broad, real-world training. They encode domain info that shows up even on unrelated inputs. (7/8)
jkminder.bsky.social
Ablations: Mixing unrelated chat data or shrinking the finetune set weakens the signal—consistent with overfitting. (6/8)
jkminder.bsky.social
Agent: The interpretability agent uses these signals to identify finetuning objectives with high accuracy by asking a few questions to the model to refine it’s hypothesis, outperforming black-box baselines. (5/8)
jkminder.bsky.social
Result: Steering with these differences reproduces the finetuning data’s style and content on unrelated prompts. (4/8)
jkminder.bsky.social
Result: Patchscope on these differences surfaces tokens tightly linked to the finetuning domain—no finetune data needed at inference. (3/8)
jkminder.bsky.social
With @butanium.bsky.social @neelnanda.bsky.social Stewart Slocum
Setup: We compute per-position average activation differences between a base and finetuned model on unrelated text. Inspect with Patchscope and by steering the finetuned model with the differences. (2/8)
jkminder.bsky.social
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
jkminder.bsky.social
Very cool initiative!
abosselut.bsky.social
The next generation of open LLMs should be inclusive, compliant, and multilingual by design. That’s why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.
icepfl.bsky.social
EPFL, ETH Zurich & CSCS just released Apertus, Switzerland’s first fully open-source large language model.
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.

Read more: actu.epfl.ch/news/apertus...
jkminder.bsky.social
What does this mean? Causal Abstraction - while still a promising framework - must explicitly constrain representational structure or include the notion of generalization, since our proof hinges on the existence of an extremely overfitted function.
More detailed thread: bsky.app/profile/deni...
denissutter.bsky.social
1/9 In our new interpretability paper, we analyse causal abstraction—the framework behind Distributed Alignment Search—and show it breaks when we remove linearity constraints on feature representations. We refer to this problem as the Non-Linear Representation Dilemma.
jkminder.bsky.social
Our proofs show that, without assuming the linear representation hypothesis, any algorithm can be mapped onto any network. Experiments confirm this: e.g. by using highly non-linear representations we can map an Indirect-Object-Identification algorithm to randomly initialized language models.
jkminder.bsky.social
Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @denissutter.bsky.social, T. Hofmann, @tpimentel.bsky.social ) that the theory collapses without the linear representation hypothesis—a problem we call the non-linear representation dilemma.
Reposted by Julian Minder
tpimentel.bsky.social
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.
jkminder.bsky.social
Could this have caught OpenAI's sycophantic model update? Maybe!

Post: lesswrong.com/posts/xmpauE...

Paper Thread: bsky.app/profile/buta...

Paper: arxiv.org/abs/2504.02922
jkminder.bsky.social
Our methods reveal interpretable features related to e.g. refusal detection, fake facts, or information about the model's identity. This highlights that model diffing is a promising research direction deserving more attention.
jkminder.bsky.social
By comparing base and chat models, we found that one of the main existing technique (crosscoders) hallucinates differences due to how its sparsity is enforced. We fixed this and also found that just training an SAE on (chat - base) activations works surprisingly well.
jkminder.bsky.social
With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
jkminder.bsky.social
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
Reposted by Julian Minder
drib.net
dribnet @drib.net · Dec 22
background: the technique here is "model-diffing" introduced by @anthropic.com just 8 weeks ago and quickly replicated by others. this includes an open source @hf.co model release by @butanium.bsky.social and @jkminder.bsky.social which I'm using. transformer-circuits.pub/2024/crossco...
Sparse Crosscoders for Cross-Layer Features and Model Diffing
transformer-circuits.pub
Reposted by Julian Minder
manoelhortaribeiro.bsky.social
New @acm-cscw.bsky.social paper, new content moderation paradigm.

Post Guidance lets moderators prevent rule-breaking by triggering interventions as users write posts!

We implemented PG on Reddit and tested it in a massive field experiment (n=97k). It became a feature!

arxiv.org/abs/2411.16814
jkminder.bsky.social
9/ We further examine the models that have been fine-tuned for this task and find evidence that the fine-tuning appears learn how to set the knob that already exists in the model.
jkminder.bsky.social
8/ 4. Learn a subspace to control the behavior in the found layer based on ideas from Distributed Alignment Search by Geiger et al..
We leveraged this recipe to find the 1D subspace in 3 different models: like Llama-3.1 , Mistral-v0.3 and Gemma-2.