Lightnews — Scholar-powered news

Daniel Khashabi @danielkhashabi.bsky.social · 7d

𝗦𝗲𝗲 𝘁𝗵𝗲 𝗱𝗲𝘁𝗮𝗶𝗹𝘀 𝗼𝗳 𝘁𝗵𝗲 𝗳𝗶𝗻𝗱𝗶𝗻𝗴𝘀: huggingface.co/papers/2509...

Work lead by @aamixsh and in collaboration with @anqi_liu33.
@HopkinsEngineer @JHUCompSci

x.com/aamixsh/sta...

Paper page - IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

huggingface.co

Daniel Khashabi @danielkhashabi.bsky.social · 7d

For 2️⃣, we introduce 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕 (𝑰𝑨𝟐) -- a method that 𝘥𝘪𝘴𝘵𝘪𝘭𝘭𝘴 𝘐𝘊𝘓 𝘢𝘤𝘵𝘪𝘷𝘢𝘵𝘪𝘰𝘯𝘴 𝘪𝘯𝘵𝘰 𝘵𝘩𝘦 𝘱𝘢𝘳𝘢𝘮𝘦𝘵𝘦𝘳𝘴 𝘰𝘧 𝘢 𝘱𝘳𝘦-𝘵𝘳𝘢𝘪𝘯𝘦𝘥 𝘮𝘰𝘥𝘦𝘭. Then, running SFT on top of this "primed" model leads to consistent gains over vanilla SFT.

1 1

Daniel Khashabi @danielkhashabi.bsky.social · 7d

On 1️⃣, building on prior findings, we find that ICL and SFT trigger distinct ⚡activation⚡ patterns -- an additional signal that ICL and SFT operate differently. We also find that ICL is generally more calibrated than SFT, though sometimes at the cost of accuracy.

1

Daniel Khashabi @danielkhashabi.bsky.social · 7d

Our latest work asks two questions:
1️⃣ Do ICL and SFT operate differently?
2️⃣ And if so, can one 𝗹𝗲𝘃𝗲𝗿𝗮𝗴𝗲 𝘁𝗵𝗲𝗶𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝗿𝗶𝘁𝘆 𝗳𝗼𝗿 𝗯𝗲𝘁𝘁𝗲𝗿 𝗮𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻?

1

Daniel Khashabi @danielkhashabi.bsky.social · 7d

ICL and SFT are the two most studied ways to adapt LMs. We understand each in isolation — but far less about how they might 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗼𝗻𝗲 𝗮𝗻𝗼𝘁𝗵𝗲𝗿.

1

Daniel Khashabi @danielkhashabi.bsky.social · 21d

Paper: arxiv.org/abs/2508.11027 (to appear in @COLM_conf)
Code: github.com/JHU-CLSP/he...

With @andrewwnlp (lead), Sophia Hager, Adi Asija, Nicholas Andrews @HopkinsEngineer @JohnsHopkins

GitHub - JHU-CLSP/hell-or-high-water: Code and data for the paper: "Hell or High Water: Evaluating Agentic Recovery from External Failures"

Code and data for the paper: "Hell or High Water: Evaluating Agentic Recovery from External Failures" - JHU-CLSP/hell-or-high-water

github.com

Daniel Khashabi @danielkhashabi.bsky.social · 21d

👉 The overall takeaway: LLM agents today are brittle in open-world environments. For real-world deployment, we need robust strategies for fallback planning and recovery.

1

Daniel Khashabi @danielkhashabi.bsky.social · 21d

(3) More tools = harder recovery. As the toolset grows, fallback planning becomes less reliable, not more.

1

Daniel Khashabi @danielkhashabi.bsky.social · 21d

(1) LLM agents struggle to recover. Even frontier models show large performance drops when tools fail.

(2) RAG on tool schemas doesn’t solve it. Across models, we observe a significant accuracy gap between adversarial and non-adversarial settings.

1

Daniel Khashabi @danielkhashabi.bsky.social · 21d

Tool failures happen in practice: APIs break, schemas change, endpoints go offline. The key question we ask is: how does your LLM-based agent recover by exploring alternative solutions?

From our analysis in our controlled environment, we find:

1 1

Daniel Khashabi @danielkhashabi.bsky.social · 21d

Imagine this: excited about the recent progress, you’ve built an agentic system that uses 🔧tools (API calls) to solve complex problems. What could go wrong?

We studied agentic tool recovery—when your LLM selects a set of tools to execute, but one turns out to be unavailable or incorrect.

1 1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

Jack lead during his time at @MSFTResearch working with
@aagohary @ASMIftekhar1 and others.

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

Paper: arxiv.org/pdf/2505.22037
🔗 Project page: aka.ms/jailbreak-d...
📊 Dataset: huggingface.co/datasets/ja...

jackzhang/JBDistill-Bench · Datasets at Hugging Face

huggingface.co

1 1 1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

Notably, we achieved >60% ASR (attack success rate) on OpenAI o1!

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

In high-level, this framework is a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination.

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

So, what's the future of AI safety benchmarks? Jack's solution is "renewable benchmarks" that allows us to refresh and expand benchmarks with a single click!!
x.com/jackjingyuz...

1 1 1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

A core hurdles in AI safety eval is that benchmarks (e.g., those on jailbreak attacks) quickly become outdated shortly after they are released (e.g., saturate, contaminate, patched).

1 1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

Jack lead during his time at @MSFTResearch working with
@aagohary @ASMIftekhar1 and others.

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

Paper: arxiv.org/pdf/2505.22037
🔗 Project page: aka.ms/jailbreak-d...
📊 Dataset: huggingface.co/datasets/ja...

jackzhang/JBDistill-Bench · Datasets at Hugging Face

huggingface.co

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

Notably, we achieved >60% ASR (attack success rate) on OpenAI o1!

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

In high-level, this framework is a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination.

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

So, what's the future of AI safety benchmarks? Jack's solution is "reviewable benchmarks" that allows us to refresh and expand benchmarks with a single click!!
x.com/jackjingyuz...

1

Daniel Khashabi @danielkhashabi.bsky.social · Aug 26

A core hurdles in AI safety eval is that benchmarks (e.g., those on jailbreak attacks) quickly become outdated shortly after they are released (e.g., saturate, contaminate, patched).

1