Cas (Stephen Casper)
@scasper.bsky.social
140 followers 180 following 110 posts
AI technical gov & risk management research. PhD student @MIT_CSAIL, fmr. UK AISI. I'm on the CS faculty job market! https://stephencasper.com/
Posts Media Videos Starter Packs
Pinned
scasper.bsky.social
📌📌📌
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com
Stephen Casper
Visit the post for more.
stephencasper.com
scasper.bsky.social
Don't forget that in AI, "sycophancy," "pandering," "personalized alignment," "steerable alignment," and "user alignment" all describe exactly the same thing.
scasper.bsky.social
Almost 2 years out from my paper with Carson Ezell et al. titled "Black-Box Access is Insufficient for Rigorous AI Audits," it's cool to see that AI companies are starting to report on [internal] evals that use fine-tuning or interp-based methods.
Reposted by Cas (Stephen Casper)
jacyanthis.bsky.social
LLM agents are optimized for thumbs-up instant gratification. RLHF -> sycophancy

We propose human agency as a new alignment target in HumanAgencyBench, made possible by AI simulation/evals. We find e.g., Claude most supports agency but also most tries to steer user values 👇 arxiv.org/abs/2509.08494
The main figure from the HumanAgencyBench paper, showing five models across the six dimensions. The table of results in the appendix has this information too.
scasper.bsky.social
I'll be leading a MATS stream this winter with a focus on technical AI governance. You can apply here by October 2!

www.matsprogram.org/apply
Apply for Winter 2026 — ML Alignment & Theory Scholars
www.matsprogram.org
scasper.bsky.social
📌📌📌
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com
Stephen Casper
Visit the post for more.
stephencasper.com
scasper.bsky.social
Here is a riddle I came up with for a draft to illustrate the differences between normal chat models and reasoning models. Can you figure it out?

Dark as night in the morning light.
I live high until I am ground.
I sit dry until I am drowned.
What am I?
scasper.bsky.social
...But in a case like this, I think our challenge is one about scalable oversight and maybe about performing RL-finetuning.
scasper.bsky.social
Suppose we have a system performing very hard and complex tasks that we don't know how to evaluate. I agree with the concern about knowing whether or not some sort of testing setup is evaluating the system to its full potential...
scasper.bsky.social
Question 3: Does few-shot fine-tuning on the test task (or some related task) beat the method being studied? If so, why is it worth studying?
scasper.bsky.social
Question 2: If this work is related to a model's "intentions," what are those, and why does it matter?
scasper.bsky.social
Question 1: Why call it "sandbagging" and not "capability elicitation" or "eval gaming"?
scasper.bsky.social
So at a minimum, I think that anyone working on "sandbagging" should have clear answers to a few questions:
scasper.bsky.social
Second, I think "sandbagging" is probably already a solved problem. Mounting evidence suggests that, if a model has a capability--even one that is adversarially hidden--few-shot fine-tuning on the target task can elicit it.
scasper.bsky.social
Pontificating about a system's 'intentions' doesn't shed any light on the technical problem of eliciting its capabilities. It just confuses people in a characteristically AI-safety-community way.
scasper.bsky.social
...framing "sandbagging" as (largely) a problem about strategic gaming from a model defines it (largely) in terms of a model's intentions. But there's no pretense, precision, or point to this.
scasper.bsky.social
And that's the first reason I am cold on "sandbagging." It reinvents and renames the goals of capability elicitation and rigorous algorithmic audits. Both already have much existing research.

But the sequel seems worse...
scasper.bsky.social
"Sandbagging" is defined as "strategic underperformance on an evaluation," whether by a model or developer. In other words, "sandbagging" just means that an evaluation didn't successfully elicit a system's full capabilities.
scasper.bsky.social
Research on AI "sandbagging" is getting more popular recently. In this 🧵, I'll give some reasons that I think it's not a useful research paradigm.

TL;DR, I think it's a confusing reframing of fairly well studied and previously solved problems.
scasper.bsky.social
There have been a couple cool pieces up recently debunking the "China is racing on AI, so the US must too" narrative.

time.com/7308857/chin...

papers.ssrn.com/sol3/papers....
scasper.bsky.social
A personal update:
- I just finished my 6-month residency at UK AISI.
- I'm going back to MIT for the final year of my PhD.
- I'm on the postdoc and faculty job markets this fall!
scasper.bsky.social
...Currently, there's just very little open research available on the nuanced effects--intended and unintended--of data curation on model capabilities. So my reaction is to emphasize the value of more open reporting and research on this kind of stuff.