Cassidy Laidlaw
cassidylaidlaw.bsky.social
Cassidy Laidlaw
@cassidylaidlaw.bsky.social
PhD student at UC Berkeley studying RL and AI safety.
https://cassidylaidlaw.com
Our new RL algorithm, AssistanceZero, trains an assistant that displays emergent helpful behaviors like *active learning* and *learning from corrections*.
April 11, 2025 at 10:17 PM
In Minecraft, we use an assistance game formulation where a simulated human is given random houses to build, and an AI assistants learns via RL to help the human out. The assistant can't see the goal house, so it has to predict the goal and maintain uncertainty to be helpful.
April 11, 2025 at 10:17 PM
We built an AI assistant that plays Minecraft with you.
Start building a house—it figures out what you’re doing and jumps in to help.

This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵
April 11, 2025 at 10:17 PM
When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵
December 19, 2024 at 5:17 PM