Yoav Gur Arieh
@yoav.ml
11 followers 50 following 17 posts
Posts Media Videos Starter Packs
Pinned
yoav.ml
🧠 To reason over text and track entities, we find that language models use three types of 'pointers'!

They were thought to rely only on a positional one—but when many entities appear, that system breaks down.

Our new paper shows what these pointers are and how they interact 👇
yoav.ml
Overall, we show that LMs retrieve entities not through a single positional mechanism, but a mixture of three: positional, lexical, and reflexive.

Understanding these mechanisms helps explain both the strengths and limits of LLMs, and how they reason in context. 8/
yoav.ml
Finally, we evaluate our model over more natural and increasingly long tasks, showing that the ‘lost-in-the-middle’ effect might be explained mechanistically by a weakening lexical signal alongside an increasingly noisy positional one. 7/
yoav.ml
We leverage these insights to build a causal model combining all three mechanisms, predicting next-token distributions with 95% agreement.

We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/
yoav.ml
We show this through extensive use of interchange interventions, evaluating over 10 binding tasks and 9 models (Gemma/Qwen/Llama 2B-72B params).

Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/
yoav.ml
Then we have the *reflexive* mechanism, which retrieves exactly the token "Holly".

This happens through a self-referential pointer originating from the "Holly" token and pointing back to it. This pointer gets copied to the "Michael" token, binding the two entities together. 4/
yoav.ml
To compensate for this, LMs use two additional mechanisms.

The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/
yoav.ml
Prior work identified only a positional mechanism, where the model tracks entities by position: here retrieving the subject from the first clause "Holly".

We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/
yoav.ml
A key part of in-context reasoning is the ability to bind entities for tracking and retrieval.

When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”

We show that this binding relies on three mechanisms. 1/
yoav.ml
🧠 To reason over text and track entities, we find that language models use three types of 'pointers'!

They were thought to rely only on a positional one—but when many entities appear, that system breaks down.

Our new paper shows what these pointers are and how they interact 👇
yoav.ml
This is a step toward targeted, interpretable, and robust knowledge removal — at the parameter level.

Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
yoav.ml
We also check robustness to relearning: Can the model relearn the erased concept from related but non-overlapping data to the eval questions?

🪝𝐏𝐈𝐒𝐂𝐄𝐒 resists relearning far better than prior methods, while others often fully recover the concept! 5/
yoav.ml
Our specificity evaluation includes similar-domain accuracy, a stricter test than others use, where🪝𝐏𝐈𝐒𝐂𝐄𝐒 outperforms all other methods.

You can erase “Harry Potter” and still do fine on Lord of the Rings and Star Wars! 4/
yoav.ml
We show that 🪝𝐏𝐈𝐒𝐂𝐄𝐒:
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/
yoav.ml
🪝𝐏𝐈𝐒𝐂𝐄𝐒 works by:
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights

No need for fine-tuning, retain sets, or enumerating facts. 2/
yoav.ml
Large language models excel at storing knowledge, but not all of it is safe or useful - e.g. chatbots for kids shouldn’t discuss guns or gambling. How can we selectively remove inappropriate conceptual knowledge while preserving utility?

Meet our method🪝𝐏𝐈𝐒𝐂𝐄𝐒!
yoav.ml
New Paper Alert! Can we precisely erase conceptual knowledge from LLM parameters?
Most methods are shallow, coarse, or overreach, adversely affecting related or general knowledge.

We introduce🪝𝐏𝐈𝐒𝐂𝐄𝐒 — a general framework for Precise In-parameter Concept EraSure. 🧵 1/
Reposted by Yoav Gur Arieh
megamor2.bsky.social
How can we interpret LLM features at scale? 🤔

Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs!
We propose efficient output-centric methods that better predict the steering effect of a feature.

New preprint led by @yoav.ml 🧵1/
Reposted by Yoav Gur Arieh
megamor2.bsky.social
What's in an attention head? 🤯

We present an efficient framework – MAPS – for inferring the functionality of attention heads in LLMs ✨directly from their parameters✨

A new preprint with Amit Elhelo 🧵 (1/10)