Lightnews — Scholar-powered news

Dana Arad

@danaarad.bsky.social

58 followers 230 following 23 posts

NLP Researcher | CS PhD Candidate @ Technion

Posts Media Videos Starter Packs

Pinned

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

Dana Arad @danaarad.bsky.social · Aug 20

Now accepted to EMNLP Main Conference!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

Dana Arad @danaarad.bsky.social · Aug 12

Submit your work to #BlackboxNLP 2025!

BlackboxNLP @blackboxnlp.bsky.social · Aug 12

📢 Call for Papers! 📢
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.

📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/

Dana Arad @danaarad.bsky.social · Aug 10

Excited to spend the rest of the summer visiting @davidbau.bsky.social's lab at Northeastern! If you’re in the area and want to chat about interpretability, let me know ☕️

Reposted by Dana Arad

Itay Itzhak @ COLM 🍁 @itay-itzhak.bsky.social · Jul 27

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!

Dana Arad @danaarad.bsky.social · Jul 23

10 days to go! Still time to run your method and submit!

BlackboxNLP @blackboxnlp.bsky.social · Jul 23

Just 10 days to go until the results submission deadline for the MIB Shared Task at #BlackboxNLP!

If you're working on:
🧠 Circuit discovery
🔍 Feature attribution
🧪 Causal variable localization
now’s the time to polish and submit!

Join us on Discord: discord.gg/n5uwjQcxPR

Dana Arad @danaarad.bsky.social · Jul 13

Three weeks is plenty of time to submit your method!

BlackboxNLP @blackboxnlp.bsky.social · Jul 13

⏳ Three weeks left! Submit your work to the MIB Shared Task at #BlackboxNLP, co-located with @emnlpmeeting.bsky.social

Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!

Dana Arad @danaarad.bsky.social · Jul 9

What are you working on for the MIB shared task?

Check out the full task description here: blackboxnlp.github.io/2025/task/

Reposted by Dana Arad

BlackboxNLP @blackboxnlp.bsky.social · Jul 7

New to mechanistic interpretability?
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation

Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR

Dana Arad @danaarad.bsky.social · Jun 26

In this work we take a step towards understanding and mitigating the vision-language performance gap, but there's still more to explore!

This was an awesome collaboration w\ Yossi Gandelsman, @boknilev.bsky.social, led by Yaniv Nikankin 🤩

Paper and code: technion-cs-nlp.github.io/vlm-circuits...

Same Task, Different Circuits – Project Page

technion-cs-nlp.github.io

Dana Arad @danaarad.bsky.social · Jun 26

By simply patching visual data tokens from later layers back into earlier ones, we improve of 4.6% on average - closing a third of the gap!

Dana Arad @danaarad.bsky.social · Jun 26

4. Zooming on data positions, we show that visual representations gradually align with their textual analogs across model layers (also shown by
@zhaofeng_wu
et al.). We hypothesize this may happen too late in the model to process the information, and fix it with back-patching.

Dana Arad @danaarad.bsky.social · Jun 26

3. Data sub-circuits, however, are modality-specific; Swapping them significantly degrades performance. This is critical - this highlights that the differences in data processing are a key factor in the performance gap.

Dana Arad @danaarad.bsky.social · Jun 26

2. Structure is only half the story: different circuits can still implement similar logic. We swap sub-circuits between modalities to measure cross-modal faithfulness.
Turns out, query and generation sub-circuits are functionally equivalent, retaining faithfulness when swapped!

Dana Arad @danaarad.bsky.social · Jun 26

1. Circuits for the same task are mostly structurally disjoint, with an average of only 18% components shared between modalities!
The overlap is extremely low in data and query positions, and moderate in the generation (last) position only.

Dana Arad @danaarad.bsky.social · Jun 26

We identify circuits (task-specific computational sub-graphs composed of attention heads and MLP neurons) used by VLMs to solve both variants.
What did we find? >>

Dana Arad @danaarad.bsky.social · Jun 26

Consider object counting: we can ask a VLM “how many books are there?” given either an image or a sequence of words. Like Kaduri et al., we consider three types of positions within the input - data (image or word sequence), query ("how many..."), and generation (last token).

Dana Arad @danaarad.bsky.social · Jun 26

VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?

In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵

Reposted by Dana Arad

BlackboxNLP @blackboxnlp.bsky.social · Jun 24

Working on circuit discovery in LMs?
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!

The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>

Reposted by Dana Arad

BlackboxNLP @blackboxnlp.bsky.social · Jun 23

Have you heard about this year's shared task? 📢

Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵

Reposted by Dana Arad

Aaron Mueller @amuuueller.bsky.social · May 27

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

Dana Arad @danaarad.bsky.social · May 28

Thank you! Added to my reading list ☺️

Dana Arad @danaarad.bsky.social · May 28

Should work now!

Dana Arad @danaarad.bsky.social · May 27

SAEs have sparked a debate over their utility; we hope to add another perspective. Would love to hear your thoughts!

Paper: arxiv.org/abs/2505.20063
Code: github.com/technion-cs-...

Huge thanks to ‪@boknilev.bsky.social‬, ‪@amuuueller.bsky.social‬, it’s been great working on this project with you!

SAEs Are Good for Steering -- If You Select the Right Features

Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output...

Dana Arad @danaarad.bsky.social · May 27

These findings have practical implications: after filtering out features with low output scores, we see 2-3x improvements for steering with SAEs, making them competitive with supervised methods on AxBench, a recent steering benchmark ( Wu and ‪@aryaman.io‬ et al.)

Dana Arad @danaarad.bsky.social · May 27

We show that high scores rarely co-occur, and emerge at different layers: features in earlier layers primarily detect input patterns, while features in later layers are more likely to drive the model’s outputs, consistent with prior analyses of LLM neuron functionality.