Mor Geva
@megamor2.bsky.social
840 followers 76 following 29 posts
https://mega002.github.io
Posts Media Videos Starter Packs
Reposted by Mor Geva
yoav.ml
🧠 To reason over text and track entities, we find that language models use three types of 'pointers'!

They were thought to rely only on a positional one—but when many entities appear, that system breaks down.

Our new paper shows what these pointers are and how they interact 👇
Reposted by Mor Geva
soheeyang.bsky.social
🚨 New Paper 🚨
How effectively do reasoning models reevaluate their thought? We find that:
- Models excel at identifying unhelpful thoughts but struggle to recover from them
- Smaller models can be more robust
- Self-reevaluation ability is far from true meta-cognitive awareness
1/N 🧵
Reposted by Mor Geva
yoav.ml
New Paper Alert! Can we precisely erase conceptual knowledge from LLM parameters?
Most methods are shallow, coarse, or overreach, adversely affecting related or general knowledge.

We introduce🪝𝐏𝐈𝐒𝐂𝐄𝐒 — a general framework for Precise In-parameter Concept EraSure. 🧵 1/
Reposted by Mor Geva
mariusmosbach.bsky.social
Checkout Benno's notes about our impact of interpretability paper 👇.

Also, we are organizing a workshop at #ICML2025 which is inspired by some of the questions discussed in the paper: actionable-interpretability.github.io
Reposted by Mor Geva
sarah-nlp.bsky.social
Have work on the actionable impact of interpretability findings? Consider submitting to our Actionable Interpretability workshop at ICML! See below for more info.

Website: actionable-interpretability.github.io
Deadline: May 9
megamor2.bsky.social
🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!
megamor2.bsky.social
Forgot to tag the one and only @hadasorgad.bsky.social !!!
megamor2.bsky.social
🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!
megamor2.bsky.social
Communication between LLM agents can be super noisy! One rogue agent can easily drag the whole system into failure 😱

We find that (1) it's possible to detect rogue agents early on
(2) interventions can boost system performance by up to 20%!

Thread with details and paper link below!
ohav.bsky.social
"One bad apple can spoil the bunch 🍎", and that's doubly true for language agents!
Our new paper shows how monitoring and intervention can prevent agents from going rogue, boosting performance by up to 20%. We're also releasing a new multi-agent environment 🕵️‍♂️
megamor2.bsky.social
In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" 🧟‍♂️ reviving hundreds of SAE features in Gemma 2! 6/
megamor2.bsky.social
In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" 🧟‍♂️ reviving hundreds of SAE features in Gemma 2! 6/
megamor2.bsky.social
Unsurprisingly, while activating inputs better describe what activates a feature, output-centric methods do much better at predicting how steering the feature will affect the model’s output!

But combining the two works best! 🚀 5/
megamor2.bsky.social
Next, we evaluate the widely-used activating inputs approach versus two output-centric methods:
- vocabulary projection (a.k.a logit lens)
- tokens with max probability change in the output

Our output-centric methods require no more than a few inference passes! 4/
megamor2.bsky.social
To fix this, we first propose using both input- and output-based evaluations for feature descriptions.
Our output-based eval measures how well a description of a feature captures its effect on the model's generation. 3/
megamor2.bsky.social
Autointerp pipelines describe neurons and SAE features based on inputs that activate them.

This is problematic ⚠️
1. Collecting activations for large data is expensive, time-consuming, and often unfeasible.
2. It overlooks how features affect model outputs!

2/
megamor2.bsky.social
How can we interpret LLM features at scale? 🤔

Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs!
We propose efficient output-centric methods that better predict the steering effect of a feature.

New preprint led by @yoav.ml 🧵1/
Reposted by Mor Geva
fbarez.bsky.social

🚨 New Paper Alert: Open Problem in Machine Unlearning for AI Safety 🚨

Can AI truly "forget"? While unlearning promises data removal, controlling emergent capabilities is a inherent challenge. Here's why it matters: 👇

Paper: arxiv.org/pdf/2501.04952
1/8
megamor2.bsky.social
Most operation descriptions are plausible based on human judgment.
We also observe interesting operations implemented by heads, like the extension of time periods (day → month → year) and association of known figures with years relevant to their historical significance (9/10)
megamor2.bsky.social
Next, we establish an automatic pipeline that uses GPT-4o to annotate the salient mappings from MAPS.
We map the attention heads of Pythia 6.9B and GPT2-xl and manage to identify operations for most heads, reaching 60%-96% in the middle and upper layers (8/10)
megamor2.bsky.social
(3) Smaller models tend to encode higher numbers of relations in a single head

(4) In Llama-3.1 models, which use grouped-query attention, grouped heads often implement the same or similar relations (7/10)
megamor2.bsky.social
(1) Different models encode certain relations across attention heads to similar degrees

(2) Different heads implement the same relation to varying degrees, which has implications for localization and editing of LLMs (6/10)
megamor2.bsky.social
Using MAPS, we study the distribution of operations across heads in different models -- Llama, Pythia, Phi, GPT2 -- and see some cool trends of function encoding universality and architecture biases: (5/10)
megamor2.bsky.social
Experiments on 20 operations and 6 LLMs show that MAPS estimations strongly correlate with the head’s outputs during inference

Ablating heads implementing an operation damages the model’s ability to perform tasks requiring the operation compared to removing other heads (4/10)