Lightnews — Scholar-powered news

Polina Kirichenko

@polkirichenko.bsky.social

We release our benchmark for people to evaluate progress on abstention!
Paper link: arxiv.org/abs/2506.09038
Code link: github.com/facebookrese...

Huge thank you to the best team ever!! Project co-leads @markibrahim.bsky.social and Sam Bell and our advisor Kamalika Chaudhuri!

9/9

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which...

arxiv.org

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

Our results also align with concurrent work from USC which also observed reasoning LLMs hallucinate on unanswerable math problems!
arxiv.org/abs/2505.13988
More evidence to argue that hallucination and failure to abstain is a big challenge in reasoning LLMs!

8/9

The Hallucination Tax of Reinforcement Finetuning

Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplor...

arxiv.org

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

While we find that a carefully crafted system prompt can boost abstention performance, it doesn't fundamentally address the core problem: a lack of reasoning about uncertainty!
See our paper for many more other results!

7/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

We find that very often reasoning models hallucinate missing contexts in the reasoning chain and while sometimes they express uncertainty and the caveats within the reasoning chain, they still produce a confident final answer. We hypothesize this arises from biases in data & rewards in RLVR.

6/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

Moreover, incorporating test-time scaling as in s1 @Muennighoff et al makes things even worse!
Allocating more reasoning budget generally improves accuracy and hurts abstention.

5/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

Remarkably, we find that reasoning post-training hurts (!) abstention performance!
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.

4/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

We curate 20 uncertainty datasets in different scenarios and evaluate 20 frontier LLMs, and find that most scenarios remain challenging even for the best models!
This allows us to conduct a systematic study of what helps and hurts abstention performance.

3/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

LLMs are great at solving concrete problems, but how well do they handle uncertainty? There are many questions with no direct answer!
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, …) and various domains (medicine, social bias, …).

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

We also have swag!! Meet the organizers during one of the breaks / informal networking sessions to pick up a sticker :)

Full schedule: sites.google.com/view/cvpr-20...
Accepted papers: sites.google.com/view/cvpr-20...

June 10, 2025 at 1:07 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news