Lightnews — Scholar-powered news

Aaron Mueller @amuuueller.bsky.social · 7d

Aruna Sankaranarayanan, @arnabsensharma.bsky.social absensharma.bsky.social @ericwtodd.bsky.social ky.social @davidbau.bsky.social u.bsky.social @boknilev.bsky.social (2/2)

3

Aaron Mueller @amuuueller.bsky.social · 7d

Thanks again to the co-authors! Such a wide survey required a lot of perspectives. @jannikbrinkmann.bsky.social Millicent Li, Samuel Marks, @koyena.bsky.social @nikhil07prakash.bsky.social @canrager.bsky.social (1/2)

1 4

Aaron Mueller @amuuueller.bsky.social · 7d

See the paper for more details! arxiv.org/abs/2408.01416

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not...

arxiv.org

1 4

Aaron Mueller @amuuueller.bsky.social · 7d

We also made the causal graph formalism more precise. Interpretability and causality are intimately linked; the latter makes the former more trustworthy and rigorous. This formal link should be strengthened in future work.

1 3

Aaron Mueller @amuuueller.bsky.social · 7d

One of the bigger changes was establishing criteria for success in interpretability. What units of analysis should you use if you know what you’re looking for? If you *don’t* know what you’re looking for?

1 2

Aaron Mueller @amuuueller.bsky.social · 7d

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

2 14 38

Reposted by Aaron Mueller

Andrew Lampinen @lampinen.bsky.social · Aug 5

In neuroscience, we often try to understand systems by analyzing their representations — using tools like regression or RSA. But are these analyses biased towards discovering a subset of what a system represents? If you're interested in this question, check out our new commentary! Thread:

What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns
Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

5 53 160

Aaron Mueller @amuuueller.bsky.social · Jul 17

If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a 𝗠echanistic 𝗜nterpretability 𝗕enchmark.

We're planning to keep this a living benchmark; come by and share your ideas/hot takes!

3 13

Reposted by Aaron Mueller

David Bau @davidbau.bsky.social · Jun 25

The new "Lookback" paper from @nikhil07prakash.bsky.social‬ contains a surprising insight...

70b/405b LLMs use double pointers, akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind.

bsky.app/profile/nik...

@nikhil07prakash.bsky.social

How do language models track mental states of each character in a story, often referred to as Theory of Mind? We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

bsky.app

1 3 28

Aaron Mueller @amuuueller.bsky.social · May 27

We still have a lot to learn in editing NN representations.

To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.

4

Aaron Mueller @amuuueller.bsky.social · May 27

By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methods—and at some locations, we outperform them!

1 2

Aaron Mueller @amuuueller.bsky.social · May 27

We define the notion of an “output feature”, whose role is to increase p(some token(s)). Steering these gives better results than steering “input features”, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.

1 1

Aaron Mueller @amuuueller.bsky.social · May 27

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

1 1 15

Reposted by Aaron Mueller

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

2 6 18

Reposted by Aaron Mueller

Ethan Gotlieb Wilcox @wegotlieb.bsky.social · May 12

Couldn’t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

1 1 3

Aaron Mueller @amuuueller.bsky.social · Apr 23

... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!

6

Aaron Mueller @amuuueller.bsky.social · Apr 23

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Iván Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

1 1 8

Aaron Mueller @amuuueller.bsky.social · Apr 23

We’re eager to establish MIB as a meaningful and lasting standard for comparing the quality of MI methods. If you’ll be at #ICLR2025 or #NAACL2025, please reach out to chat!

📜 arxiv.org/abs/2504.13151

MIB: A Mechanistic Interpretability Benchmark

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spann...

arxiv.org

1 5

Aaron Mueller @amuuueller.bsky.social · Apr 23

We release many public resources, including:

🌐 Website: mib-bench.github.io
📄 Data: huggingface.co/collections/...
💻 Code: github.com/aaronmueller...
📊 Leaderboard: Coming very soon!

MIB – Project Page

mib-bench.github.io

1 1 3

Aaron Mueller @amuuueller.bsky.social · Apr 23

These results highlight that there has been real progress in the field! We also recovered known findings, like that integrated gradients improves attribution quality. This is a sanity check verifying that our benchmark is capturing something real.

1 2

Aaron Mueller @amuuueller.bsky.social · Apr 23

We find that supervised methods like DAS significantly outperform methods like sparse autoencoders or principal component analysis. Mask-learning methods also perform well, but not as well as DAS.

Table of results for the causal variable localization track.

1 1 6

Aaron Mueller @amuuueller.bsky.social · Apr 23

This is evaluated using the interchange intervention accuracy (IIA): we featurize the activations, intervene on the specific causal variable, and see whether the intervention has the expected effect on model behavior.

Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

1 2

Aaron Mueller @amuuueller.bsky.social · Apr 23

The causal variable localization track measures the quality of featurization methods (like DAS, SAEs, etc.). How well can we decompose activations into more meaningful units, and intervene selectively on just the target variable?

Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

1 2

Aaron Mueller @amuuueller.bsky.social · Apr 23

We find that edge-level methods generally outperform node-level methods, that attribution patching with integrated gradients generally outperforms other methods (including more exact methods!), and that mask-learning methods perform well.

Table summarizing the results from the circuit localization track.

1 2

Aaron Mueller @amuuueller.bsky.social · Apr 23

Thus, we split 𝘧 into two metrics: the integrated 𝗰𝗶𝗿𝗰𝘂𝗶𝘁 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗿𝗮𝘁𝗶𝗼 (CPR), and the integrated 𝗰𝗶𝗿𝗰𝘂𝗶𝘁–𝗺𝗼𝗱𝗲𝗹 𝗱𝗶𝘀𝘁𝗮𝗻𝗰𝗲 (CMD). Both involve integrating 𝘧 across many circuit sizes. This implicitly captures 𝗳𝗮𝗶𝘁𝗵𝗳𝘂𝗹𝗻𝗲𝘀𝘀 and 𝗺𝗶𝗻𝗶𝗺𝗮𝗹𝗶𝘁𝘆 at the same time!

Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

1 2