Lightnews — Scholar-powered news

Reposted by John (Yueh-Han) Chen

Maksym Andriushchenko @maksym-andr.bsky.social · Jun 19

🚨Excited to release OS-Harm! 🚨

The safety of computer use agents has been largely overlooked.

We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm:
1. deliberate user misuse,
2. prompt injections,
3. model misbehavior.

1 2 3

Reposted by John (Yueh-Han) Chen

NYU Center for Data Science @nyudatascience.bsky.social · Aug 29

Frontier AI systems failed to reliably flag safety risks related to more than 40% of common safety facts tested in the SAGE‑Eval benchmark by Yueh-Han (John) Cheni, @guydav.bsky.social, and @brendenlake.bsky.social.

nyudatascience.medium.com/even-the-top...

Even the Top LLM Failed to Reliably Flag Some Risks Related to 40% of Safety Facts

CDS’ SAGE‑Eval shows top‑performing AI models failed at least 42% of safety warnings in novel scenarios.

nyudatascience.medium.com

2 2

Reposted by John (Yueh-Han) Chen

NYU Center for Data Science @nyudatascience.bsky.social · May 30

CDS PhD student @vishakhpk.bsky.social, with co-authors @johnchen6.bsky.social, Jane Pan, Valerie Chen, and CDS Associate Professor @hhexiy.bsky.social, has published new research on the trade-off between originality and quality in LLM outputs.

Read more: nyudatascience.medium.com/in-ai-genera...

In AI-Generated Content, A Trade-Off Between Quality and Originality

New research from CDS researchers maps the trade-off between originality and quality in LLM outputs.

nyudatascience.medium.com

1 2 2

Reposted by John (Yueh-Han) Chen

Guy Davidson @guydav.bsky.social · May 30

Fantastic new work by @johnchen6.bsky.social (with @brendenlake.bsky.social and me trying not to cause too much trouble).

We study systematic generalization in a safety setting and find LLMs struggle to consistently respond safely when we vary how we ask naive questions. More analyses in the paper!

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵

3 10

Reposted by John (Yueh-Han) Chen

Brenden Lake @brendenlake.bsky.social · May 29

Failures of systematic generalization in LLMs can lead to real-world safety issues.

New paper by @johnchen6.bsky.social and @guydav.bsky.social, arxiv.org/abs/2505.21828

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵

2 5

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Our data: huggingface.co/datasets/Yue...)
Code: github.com/YuehHanChen/....

We recommend that AI companies use SAGE-Eval in pre-deployment evaluations to assess model reliability when addressing salient risks in naive user prompts.

YuehHanChen/SAGE-Eval · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Overall, our findings suggest the systematicity gap: unlike humans—who generalize a safety fact learned in one context to any structurally related context—LLMs today exhibit only piecemeal safety, identifying critical knowledge in isolation but failing to apply it broadly.
12/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

> We compared the performance of OLMo-2-32B-SFT to OLMo-2-32B-DPO, which is the SFT version further trained with DPO. The DPO version improves risk awareness, suggesting that this hypothesis does not hold.
11/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Hypothesis 2: Does RLHF affect safety performance on SAGE-Eval? Would it be possible that humans prefer “less annoying” responses, potentially diminishing the presence of critical safety warnings?

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

> We used The Pile as a proxy for pre-training data of frontier models, and Google search result count as a secondary method. We failed to find a statistically significant correlation using both methods, suggesting that fact frequency alone doesn’t predict performance on SAGE-Eval.
10/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

To understand the root causes, we explore two hypotheses:
Hypothesis 1: Is there any correlation between fact frequency in pre-training data and safety performance on SAGE-Eval?

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

In practice, deployed LLMs will face a vastly richer and more varied set of user prompts than any finite benchmark can cover. We show that model developers can forecast SAGE-Eval safety scores with at least one order of magnitude more prompts per fact with a power law fit. 9/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Finding 4: Model capability and training compute only weakly correlate with performance on SAGE-Eval, demonstrating that our benchmark effectively avoids “safetywashing”—a scenario where capability improvements are incorrectly portrayed as advancements in safety. 8/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Finding 3: certain tones degrade safety performance. 7/🧵 In real life, users might prompt LMs in different tones. The depressed tone reduces the safety score to 0.865, noticeably below the no-augmentation baseline of 0.907.

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Finding 2: Long context undermines risk awareness. Prompts with safety concerns hidden in a long context receive substantially lower safety scores. 6/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Finding 1: All frontier LLMs we tested score <58% safety scores.
Our model-level safety score is defined as % of safety facts 100% passed all test scenario prompts (~100 scenarios per safety fact).
5/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Property 3:
SAGE-Eval can be automatically evaluated: we confirm evaluation accuracy by manually labeling 100 model responses as safe or unsafe. In our experiments, we find perfect alignment between human judgments and an LLM-as-a-judge using frontier models as judges
4/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Property 2:
SAGE-Eval is human-verified by 144 human annotators. If one human disagrees with the label, we manually edit or remove it. We then augment these questions with programming-based techniques (add typos or different tones) to extend each fact to around 100 test scenarios.
3/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Property 1:
SAGE-Eval covers diverse safety categories—including Child, Outdoor Activities, and Medicine—and comprises 104 safety facts manually sourced from reputable organizations such as the CDC and FDA.
2/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

To evaluate the systematic generalization of safety knowledge to novel situations, we designed SAGE-Eval with 3 main properties:

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

>Do LLMs robustly generalize critical safety facts to novel scenarios?
Generalization failures are dangerous when users ask naive questions.

1/🧵

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Arxiv: arxiv.org/pdf/2505.21828
Joint work with @guydav.bsky.social @brendenlake.bsky.social
🧵 starts below!

1 1

John (Yueh-Han) Chen @johnchen6.bsky.social · May 29

Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵

1 3

Reposted by John (Yueh-Han) Chen

Vishakh Padmakumar @vishakhpk.bsky.social · Apr 29

What does it mean for #LLM output to be novel?
In work w/ johnchen6.bsky.social, Jane Pan, Valerie Chen and He He, we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵

2 4 7

Reposted by John (Yueh-Han) Chen

Ai2 @ai2.bsky.social · Mar 26

Meet Ai2 Paper Finder, an LLM-powered literature search system.

Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍

Screenshot of the Ai2 Paper Finder interface

6 23 120