Jakub Łucki
@jakublucki.bsky.social
24 followers 38 following 11 posts
Visiting Researcher at NASA JPL | Data Science MSc at ETH Zurich
Posts Media Videos Starter Packs
Pinned
jakublucki.bsky.social
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇
jakublucki.bsky.social
Just arrived in Vancouver for #NeurIPS! If you’d like to chat about cutting-edge research, let me know! I’ve always been curious about far too many things (for my own good), so all topics are welcome.

If you can’t catch me during the week, stop by our poster on the weekend or join the presentation!
jakublucki.bsky.social
Our paper on how unlearning fails to remove hazardous knowledge from LLM weights received 🏆 Best Paper 🏆 award at SoLaR @ NeurIPS!

Join my oral presentation on Saturday at 4:30 pm to learn more.
jakublucki.bsky.social
Our findings highlight that

1️⃣ Robust unlearning is not yet possible; current methods face similar challenges as safety training.

2️⃣ Black-box evaluations can be misleading when assessing the effectiveness of unlearning.
jakublucki.bsky.social
Fine-tuning “unlearned” models on benign datasets can completely restore hazardous knowledge.

Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).
jakublucki.bsky.social
🔡 GCG can be adapted to generate universal adversarial prefixes.

↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.

✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.
jakublucki.bsky.social
How did we check this?

We adapted several white-box attacks used to jailbreak safety-trained models and applied them to two prominent unlearning methods: RMU, NPO.
jakublucki.bsky.social
Safety training fine-tunes models to refuse harmful requests but can be easily jailbroken.

Machine unlearning was introduced to fully erase hazardous knowledge, making it inaccessible to adversaries.

Sounds amazing, right? Well, existing methods cannot do this (yet).
jakublucki.bsky.social
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇