Lightnews — Scholar-powered news

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

780 followers 58 following 20 posts

PhD student at UC Berkeley studying RL and AI safety.
https://cassidylaidlaw.com

Posts Replies Media Videos

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

Our new RL algorithm, AssistanceZero, trains an assistant that displays emergent helpful behaviors like *active learning* and *learning from corrections*.

April 11, 2025 at 10:17 PM

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

In Minecraft, we use an assistance game formulation where a simulated human is given random houses to build, and an AI assistants learns via RL to help the human out. The assistant can't see the goal house, so it has to predict the goal and maintain uncertainty to be helpful.

April 11, 2025 at 10:17 PM

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

We built an AI assistant that plays Minecraft with you.
Start building a house—it figures out what you’re doing and jumps in to help.

This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

April 11, 2025 at 10:17 PM

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

December 19, 2024 at 5:17 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news