Polina Kirichenko
polkirichenko.bsky.social
Polina Kirichenko
@polkirichenko.bsky.social
ML researcher
We release our benchmark for people to evaluate progress on abstention!
Paper link: arxiv.org/abs/2506.09038
Code link: github.com/facebookrese...

Huge thank you to the best team ever!! Project co-leads @markibrahim.bsky.social and Sam Bell and our advisor Kamalika Chaudhuri!

9/9
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which...
arxiv.org
June 16, 2025 at 10:03 PM
Our results also align with concurrent work from USC which also observed reasoning LLMs hallucinate on unanswerable math problems!
arxiv.org/abs/2505.13988
More evidence to argue that hallucination and failure to abstain is a big challenge in reasoning LLMs!

8/9
The Hallucination Tax of Reinforcement Finetuning
Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplor...
arxiv.org
June 16, 2025 at 10:03 PM
While we find that a carefully crafted system prompt can boost abstention performance, it doesn't fundamentally address the core problem: a lack of reasoning about uncertainty!
See our paper for many more other results!

7/9
June 16, 2025 at 10:03 PM
We find that very often reasoning models hallucinate missing contexts in the reasoning chain and while sometimes they express uncertainty and the caveats within the reasoning chain, they still produce a confident final answer. We hypothesize this arises from biases in data & rewards in RLVR.

6/9
June 16, 2025 at 10:03 PM
Moreover, incorporating test-time scaling as in s1 @Muennighoff et al makes things even worse!
Allocating more reasoning budget generally improves accuracy and hurts abstention.

5/9
June 16, 2025 at 10:03 PM
Remarkably, we find that reasoning post-training hurts (!) abstention performance!
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.

4/9
June 16, 2025 at 10:03 PM
We curate 20 uncertainty datasets in different scenarios and evaluate 20 frontier LLMs, and find that most scenarios remain challenging even for the best models!
This allows us to conduct a systematic study of what helps and hurts abstention performance.

3/9
June 16, 2025 at 10:03 PM
LLMs are great at solving concrete problems, but how well do they handle uncertainty? There are many questions with no direct answer!
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, …) and various domains (medicine, social bias, …).
June 16, 2025 at 10:03 PM
We also have swag!! Meet the organizers during one of the breaks / informal networking sessions to pick up a sticker :)

Full schedule: sites.google.com/view/cvpr-20...
Accepted papers: sites.google.com/view/cvpr-20...
June 10, 2025 at 1:07 PM