Lightnews — Scholar-powered news

Reposted

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

2 6 18

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

🚀 Impact: As LMs become ubiquitous, protecting privacy while maintaining utility is crucial.
REVS offers a practical solution for post-hoc removal of sensitive information.

📄Paper: technion-cs-nlp.github.io/REVS/
👨‍💻Code: github.com/Tomertech/REVS

#Unlearning #NLProc #ACL2025NLP
8/8

2

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

🛡️ Extraction Resistance: REVS is more robust against sophisticated attacks:
-Logit-lens attacks
-Delta attacks
-Perturbation attacks
Critical for real-world deployment where adversaries actively try to extract "unlearned" info.
7/8

1 2

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

🏆Results: REVS outperforms 6 strong baselines across all metrics:
✅Superior unlearning effectiveness
✅Better model integrity preservation
✅Stronger resistance to extraction attacks
✅Robust across different hyperparameters
6/8

1 3

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

📊Evaluation: We curated 3 datasets with actual sensitive information:
Emails & URLs naturally memorized by Llama-3-8B & GPT-J-6B
Synthetic SSN dataset where we induced memorization
Real sensitive data = real evaluation!
5/8

1 1

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

🔬How REVS Works:
1. Localization: Find layers & neurons most responsible for generating target tokens
2. Editing: Modify neurons in vocabulary space to demote sensitive tokens
3. Preservation: Keep general model knowledge intact
All without gradients!
4/8

1 2

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

💡Our Solution - REVS: A novel non-gradient method that surgically removes sensitive info while preserving model capabilities.
Key insight: We identify neurons that promote sensitive tokens in vocabulary space and modify them to demote those tokens to lower ranks.
3/8

1 2

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

🔎 The Problem:
LMs can regurgitate private info from training.
Prompt: "Contact David Lewis at" → "[email protected]"
This violates privacy regulations like GDPR and poses serious security risks.
2/8

1 2

tomerashuach.bsky.social @tomerashuach.bsky.social · May 27

🚨New paper at #ACL2025 Findings!
REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space.
LMs memorize and leak sensitive data—emails, SSNs, URLs from their training.
We propose a surgical method to unlearn it.
🧵👇w/ @boknilev.bsky.social @mtutek.bsky.social
1/8

1 2 6