John (Yueh-Han) Chen
@johnchen6.bsky.social
17 followers 97 following 18 posts
Graduate Student Researcher @nyu prev @ucberkeley https://john-chen.cc
Posts Media Videos Starter Packs
Pinned
johnchen6.bsky.social
Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵
Reposted by John (Yueh-Han) Chen
maksym-andr.bsky.social
🚨Excited to release OS-Harm! 🚨

The safety of computer use agents has been largely overlooked.

We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm:
1. deliberate user misuse,
2. prompt injections,
3. model misbehavior.
Reposted by John (Yueh-Han) Chen
Reposted by John (Yueh-Han) Chen
guydav.bsky.social
Fantastic new work by @johnchen6.bsky.social (with @brendenlake.bsky.social and me trying not to cause too much trouble).

We study systematic generalization in a safety setting and find LLMs struggle to consistently respond safely when we vary how we ask naive questions. More analyses in the paper!
johnchen6.bsky.social
Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵
Reposted by John (Yueh-Han) Chen
brendenlake.bsky.social
Failures of systematic generalization in LLMs can lead to real-world safety issues.

New paper by @johnchen6.bsky.social and @guydav.bsky.social, arxiv.org/abs/2505.21828
johnchen6.bsky.social
Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵
johnchen6.bsky.social
Overall, our findings suggest the systematicity gap: unlike humans—who generalize a safety fact learned in one context to any structurally related context—LLMs today exhibit only piecemeal safety, identifying critical knowledge in isolation but failing to apply it broadly.
12/🧵
johnchen6.bsky.social
> We compared the performance of OLMo-2-32B-SFT to OLMo-2-32B-DPO, which is the SFT version further trained with DPO. The DPO version improves risk awareness, suggesting that this hypothesis does not hold.
11/🧵
johnchen6.bsky.social
Hypothesis 2: Does RLHF affect safety performance on SAGE-Eval? Would it be possible that humans prefer “less annoying” responses, potentially diminishing the presence of critical safety warnings?
johnchen6.bsky.social
> We used The Pile as a proxy for pre-training data of frontier models, and Google search result count as a secondary method. We failed to find a statistically significant correlation using both methods, suggesting that fact frequency alone doesn’t predict performance on SAGE-Eval.
10/🧵
johnchen6.bsky.social
To understand the root causes, we explore two hypotheses:
Hypothesis 1: Is there any correlation between fact frequency in pre-training data and safety performance on SAGE-Eval?
johnchen6.bsky.social
In practice, deployed LLMs will face a vastly richer and more varied set of user prompts than any finite benchmark can cover. We show that model developers can forecast SAGE-Eval safety scores with at least one order of magnitude more prompts per fact with a power law fit. 9/🧵
johnchen6.bsky.social
Finding 4: Model capability and training compute only weakly correlate with performance on SAGE-Eval, demonstrating that our benchmark effectively avoids “safetywashing”—a scenario where capability improvements are incorrectly portrayed as advancements in safety. 8/🧵
johnchen6.bsky.social
Finding 3: certain tones degrade safety performance. 7/🧵 In real life, users might prompt LMs in different tones. The depressed tone reduces the safety score to 0.865, noticeably below the no-augmentation baseline of 0.907.
johnchen6.bsky.social
Finding 2: Long context undermines risk awareness. Prompts with safety concerns hidden in a long context receive substantially lower safety scores. 6/🧵
johnchen6.bsky.social
Finding 1: All frontier LLMs we tested score <58% safety scores.
Our model-level safety score is defined as % of safety facts 100% passed all test scenario prompts (~100 scenarios per safety fact).
5/🧵
johnchen6.bsky.social
Property 3:
SAGE-Eval can be automatically evaluated: we confirm evaluation accuracy by manually labeling 100 model responses as safe or unsafe. In our experiments, we find perfect alignment between human judgments and an LLM-as-a-judge using frontier models as judges
4/🧵
johnchen6.bsky.social
Property 2:
SAGE-Eval is human-verified by 144 human annotators. If one human disagrees with the label, we manually edit or remove it. We then augment these questions with programming-based techniques (add typos or different tones) to extend each fact to around 100 test scenarios.
3/🧵
johnchen6.bsky.social
Property 1:
SAGE-Eval covers diverse safety categories—including Child, Outdoor Activities, and Medicine—and comprises 104 safety facts manually sourced from reputable organizations such as the CDC and FDA.
2/🧵
johnchen6.bsky.social
To evaluate the systematic generalization of safety knowledge to novel situations, we designed SAGE-Eval with 3 main properties:
johnchen6.bsky.social
>Do LLMs robustly generalize critical safety facts to novel scenarios?
Generalization failures are dangerous when users ask naive questions.

1/🧵
johnchen6.bsky.social
Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵
Reposted by John (Yueh-Han) Chen
vishakhpk.bsky.social
What does it mean for #LLM output to be novel?
In work w/ johnchen6.bsky.social, Jane Pan, Valerie Chen and He He, we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵
Reposted by John (Yueh-Han) Chen
ai2.bsky.social
Ai2 @ai2.bsky.social · Mar 26
Meet Ai2 Paper Finder, an LLM-powered literature search system.

Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍
Screenshot of the Ai2 Paper Finder interface