William Jurayj
@williamjurayj.bsky.social
420 followers 220 following 14 posts
PhD student at Johns Hopkins CLSP (@jhuclsp.bsky.social). Researching natural and formal language processing. williamjurayj.com
Posts Media Videos Starter Packs
Pinned
williamjurayj.bsky.social
🚨 You are only evaluating a slice of your test-time scaling model's performance! 🚨

📈 We consider how models’ confidence in their answers changes as test-time compute increases. Reasoning longer helps models answer more confidently!

📝: arxiv.org/abs/2502.13962
Reposted by William Jurayj
jhucompsci.bsky.social
JHU computer scientists including @williamjurayj.bsky.social propose a method that allows #AI models to spend more time thinking through problems & uses a confidence score to determine when the AI should say "I don't know" rather than risking a wrong answer, which is crucial for high-stakes domains.
Teaching AI to admit uncertainty
Johns Hopkins researchers show how different "odds" can teach AI models to admit when they're not confident enough in an answer
hub.jhu.edu
Reposted by William Jurayj
mustafasuleymanai.bsky.social
You can't just be right, you have to know you're right. Good advice for LLMs, according to new Johns Hopkins research. Sometimes no answer is better than a wrong one - life or death choices in medicine, for example, or big financial decisions. 🧵
a 3D graph with the X axis of compute budget, Y axis of accuracy, and Z axis of confidence threshold. The chart shows that accuracy increases with higher compute and confidence thresholds, though the trade-off tends to be fewer questions answered overall.
williamjurayj.bsky.social
and here I was thinking you were out at the Opera 🤯
williamjurayj.bsky.social
It's been a joy working with @jeff-cheng.bsky.social & Ben Van Durme on this project. And huge thanks to @alexmartin314.bsky.social, @miriamsw.bsky.social, @marcmarone.com, @orionweller.bsky.social, and everyone else who gave very helpful feedback over the past weeks.
williamjurayj.bsky.social
To our knowledge this is the first work to raise this point in the new area of LLM test-time scaling, but the community has been aware of this for a long time. E.g., the Watson effort on Jeopardy, and a push by Jordan Boyd-Graber to reward systems that hold back dubious answers.
williamjurayj.bsky.social
We propose the standard evaluation format of “Jeopardy odds”: win a point when you’re right, lose a point when you’re wrong. Here we see compute scaling distinctions that were hidden when evaluating under a zero-risk setting. Selection functions matter!
williamjurayj.bsky.social
We test DeepSeek-R1 and find that scaling test-time compute can substantially increase a model’s confidence in correct answers, drawing a wider gap between correct and incorrect answers.
williamjurayj.bsky.social
🚨 You are only evaluating a slice of your test-time scaling model's performance! 🚨

📈 We consider how models’ confidence in their answers changes as test-time compute increases. Reasoning longer helps models answer more confidently!

📝: arxiv.org/abs/2502.13962
williamjurayj.bsky.social
I’d say a key factor is whether a person’s put in a good faith effort to be right for the right reasons. But I’m to other explanations!
nikhilsksharma.bsky.social
Had a good conversation about "What exactly is misinformation?" with
@williamjurayj.bsky.social

Thread below
williamjurayj.bsky.social
In many ways, the Vision Pro hits on both categories.
williamjurayj.bsky.social
At this point, I would probably buy a cellular phone that they made
williamjurayj.bsky.social
I think 17th century English were more likely to be enjoying Tea than Coffee
williamjurayj.bsky.social
Did you recently visit an Apple store?
williamjurayj.bsky.social
I saw this happen live, it was tragic
Reposted by William Jurayj
marcmarone.com
I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux

Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!