Daniel Khashabi
@danielkhashabi.bsky.social
17 followers 25 following 74 posts
I play with intuitions and data. Now: @jhuclsp @jhucompsci Past: @allen_ai @uwnlp @Penn @cogcomp @Illinois_Alma @MSFTResearch
Posts Media Videos Starter Packs
danielkhashabi.bsky.social
𝗦𝗲𝗲 𝘁𝗵𝗲 𝗱𝗲𝘁𝗮𝗶𝗹𝘀 𝗼𝗳 𝘁𝗵𝗲 𝗳𝗶𝗻𝗱𝗶𝗻𝗴𝘀: huggingface.co/papers/2509...

Work lead by @aamixsh and in collaboration with @anqi_liu33.
@HopkinsEngineer @JHUCompSci

x.com/aamixsh/sta...
Paper page - IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
huggingface.co
danielkhashabi.bsky.social
For 2️⃣, we introduce 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕 (𝑰𝑨𝟐) -- a method that 𝘥𝘪𝘴𝘵𝘪𝘭𝘭𝘴 𝘐𝘊𝘓 𝘢𝘤𝘵𝘪𝘷𝘢𝘵𝘪𝘰𝘯𝘴 𝘪𝘯𝘵𝘰 𝘵𝘩𝘦 𝘱𝘢𝘳𝘢𝘮𝘦𝘵𝘦𝘳𝘴 𝘰𝘧 𝘢 𝘱𝘳𝘦-𝘵𝘳𝘢𝘪𝘯𝘦𝘥 𝘮𝘰𝘥𝘦𝘭. Then, running SFT on top of this "primed" model leads to consistent gains over vanilla SFT.
danielkhashabi.bsky.social
On 1️⃣, building on prior findings, we find that ICL and SFT trigger distinct ⚡activation⚡ patterns -- an additional signal that ICL and SFT operate differently. We also find that ICL is generally more calibrated than SFT, though sometimes at the cost of accuracy.
danielkhashabi.bsky.social
Our latest work asks two questions:
1️⃣ Do ICL and SFT operate differently?
2️⃣ And if so, can one 𝗹𝗲𝘃𝗲𝗿𝗮𝗴𝗲 𝘁𝗵𝗲𝗶𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝗿𝗶𝘁𝘆 𝗳𝗼𝗿 𝗯𝗲𝘁𝘁𝗲𝗿 𝗮𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻?
danielkhashabi.bsky.social
ICL and SFT are the two most studied ways to adapt LMs. We understand each in isolation — but far less about how they might 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗼𝗻𝗲 𝗮𝗻𝗼𝘁𝗵𝗲𝗿.
danielkhashabi.bsky.social
👉 The overall takeaway: LLM agents today are brittle in open-world environments. For real-world deployment, we need robust strategies for fallback planning and recovery.
danielkhashabi.bsky.social
(3) More tools = harder recovery. As the toolset grows, fallback planning becomes less reliable, not more.
danielkhashabi.bsky.social
(1) LLM agents struggle to recover. Even frontier models show large performance drops when tools fail.

(2) RAG on tool schemas doesn’t solve it. Across models, we observe a significant accuracy gap between adversarial and non-adversarial settings.
danielkhashabi.bsky.social
Tool failures happen in practice: APIs break, schemas change, endpoints go offline. The key question we ask is: how does your LLM-based agent recover by exploring alternative solutions?

From our analysis in our controlled environment, we find:
danielkhashabi.bsky.social
Imagine this: excited about the recent progress, you’ve built an agentic system that uses 🔧tools (API calls) to solve complex problems. What could go wrong?

We studied agentic tool recovery—when your LLM selects a set of tools to execute, but one turns out to be unavailable or incorrect.
danielkhashabi.bsky.social
Jack lead during his time at @MSFTResearch working with
@aagohary @ASMIftekhar1 and others.
danielkhashabi.bsky.social
Notably, we achieved >60% ASR (attack success rate) on OpenAI o1!
danielkhashabi.bsky.social
But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.
danielkhashabi.bsky.social
In high-level, this framework is a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination.
danielkhashabi.bsky.social
So, what's the future of AI safety benchmarks? Jack's solution is "renewable benchmarks" that allows us to refresh and expand benchmarks with a single click!!
x.com/jackjingyuz...
danielkhashabi.bsky.social
A core hurdles in AI safety eval is that benchmarks (e.g., those on jailbreak attacks) quickly become outdated shortly after they are released (e.g., saturate, contaminate, patched).
danielkhashabi.bsky.social
Jack lead during his time at @MSFTResearch working with
@aagohary @ASMIftekhar1 and others.
danielkhashabi.bsky.social
Notably, we achieved >60% ASR (attack success rate) on OpenAI o1!
danielkhashabi.bsky.social
But how do we measure benchmark effectiveness? A key premise is that, effectiveness of attack prompts on dev models can predict their effectiveness on unseen eval models. Jack verifies that indeed this is the case: his resulting benchmark JBDistill-Bench is effective on a *unseen* models.
danielkhashabi.bsky.social
In high-level, this framework is a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination.
danielkhashabi.bsky.social
So, what's the future of AI safety benchmarks? Jack's solution is "reviewable benchmarks" that allows us to refresh and expand benchmarks with a single click!!
x.com/jackjingyuz...
danielkhashabi.bsky.social
A core hurdles in AI safety eval is that benchmarks (e.g., those on jailbreak attacks) quickly become outdated shortly after they are released (e.g., saturate, contaminate, patched).