patqdasilva.bsky.social
@patqdasilva.bsky.social
8 followers 6 following 12 posts
Posts Media Videos Starter Packs
Pinned
Super grateful to have received senior area chair highlight at #ACL2025NLP
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
Super grateful to have received senior area chair highlight at #ACL2025NLP
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
🌟Excited to announce that “Steering off Course” was accepted to #ACL2025NLP for an Oral and Panel Discussion! arxiv.org/abs/2504.04635
📍Wed, 9AM, Level 2 Hall A

🍁I will also share this work at Actionable Interpretability @ActInterp at #ICML2025
📍Sat, 1PM, East Ballroom A
Steering language models by directly intervening on internal activations is appealing–but does it generalize?

We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
We hope to inspire further research into internal transformer mechanism variance, so future steering methods can be robust and adaptable to new releases 🥕🐇💨

Special shoutout to my advisor @shocheen.bsky.social and collaborators @harisethuram.bsky.social, Dheeraj Rajagopal, Hanna Hajishirzi
(10/10)
We report many aggregated results in our paper, and invite researchers to comb through the extensive results in our repository to build intuitions about model variance

Our paper arxiv.org/abs/2504.04635
Code, Data, Results, and Figures for all LMs github.com/patqdasilva/steering-off-course
(9/10)
✨Localization hypothesis does not always hold for FVs✨

Simple word-pair ICL translation from English into another language require more heads to be effective

Post-trained models are more steerable when more heads are used in the FV
(8/10)
✨Localization hypothesis does not always hold for FVs✨

FVs rely on the assumption that information required for ICL is stored and activated within a small subset of heads 🎯

But certain models require many heads in their FV before recovering performance 🙉🙉🙉➡️📈
(7/10)
🔔Neither Function Vectors nor Task Vectors are generalizable🔔

Several model families, even after significant hyperparameter tuning, show no improvement or even decline in relevant steering metrics🧰📉
(6/10)
🔔Neither Function Vectors nor Task Vectors are generalizable🔔

Even with the best-performing tool, FVs with full hyperparameter search, only 76% of model-task combinations recover 50% of 5-shot performance
(5/10)
DoLa’s poor efficacy could stem from the flawed assumption that factual
knowledge of an LM evolves gradually across layers

Correct/incorrect answer tokens have low probabilities before spiking at the same layer, suggesting contrasts with early layers are uninformative
(4/10)
DoLa contrasts token probabilities across layers to enhance factuality

✅Consistent with prior work, we find that DoLa works decently for Llama 1 TruthfulQA and FACTOR

❌However, for all other models tested, the improvements afforded by DoLa in most metrics is negligible
(3/10)
We examine steering methods inspired from Logit Lens and Activation Patching, specifically:
1️⃣DoLa arxiv.org/abs/2309.03883
2️⃣Function Vectors arxiv.org/abs/2310.15213
3️⃣Task Vectors arxiv.org/abs/2310.15916
(2/10)
Steering language models by directly intervening on internal activations is appealing–but does it generalize?

We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)