Paper link: arxiv.org/abs/2506.09038
Code link: github.com/facebookrese...
Huge thank you to the best team ever!! Project co-leads @markibrahim.bsky.social and Sam Bell and our advisor Kamalika Chaudhuri!
9/9
Paper link: arxiv.org/abs/2506.09038
Code link: github.com/facebookrese...
Huge thank you to the best team ever!! Project co-leads @markibrahim.bsky.social and Sam Bell and our advisor Kamalika Chaudhuri!
9/9
arxiv.org/abs/2505.13988
More evidence to argue that hallucination and failure to abstain is a big challenge in reasoning LLMs!
8/9
arxiv.org/abs/2505.13988
More evidence to argue that hallucination and failure to abstain is a big challenge in reasoning LLMs!
8/9
See our paper for many more other results!
7/9
See our paper for many more other results!
7/9
6/9
6/9
Allocating more reasoning budget generally improves accuracy and hurts abstention.
5/9
Allocating more reasoning budget generally improves accuracy and hurts abstention.
5/9
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.
4/9
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.
4/9
This allows us to conduct a systematic study of what helps and hurts abstention performance.
3/9
This allows us to conduct a systematic study of what helps and hurts abstention performance.
3/9
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, …) and various domains (medicine, social bias, …).
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, …) and various domains (medicine, social bias, …).
Full schedule: sites.google.com/view/cvpr-20...
Accepted papers: sites.google.com/view/cvpr-20...
Full schedule: sites.google.com/view/cvpr-20...
Accepted papers: sites.google.com/view/cvpr-20...