Alex Turner
@turntrout.bsky.social
New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @alexirpan.bsky.social, me, Mark Kurzeja, David Elson, and Rohin Shah. (thread)
November 4, 2025 at 12:18 AM
New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @alexirpan.bsky.social, me, Mark Kurzeja, David Elson, and Rohin Shah. (thread)
"Authoritarianism can't happen here." Sadly, I think that it IS happening here. Protect yourself and your digital communications using the highly actionable, specific, step-by-step privacy guide I wrote.
October 29, 2025 at 6:12 PM
"Authoritarianism can't happen here." Sadly, I think that it IS happening here. Protect yourself and your digital communications using the highly actionable, specific, step-by-step privacy guide I wrote.
Want to get into alignment research? Alex Cloud & I mentor *Team Shard*, responsible for gradient routing, steering vectors, MELBO, and a new unlearning technique (TBA) :) We discover new research subfields.
Apply for mentorship this summer at forms.matsprogram.org/turner-app-8
Apply for mentorship this summer at forms.matsprogram.org/turner-app-8
March 20, 2025 at 4:14 PM
Want to get into alignment research? Alex Cloud & I mentor *Team Shard*, responsible for gradient routing, steering vectors, MELBO, and a new unlearning technique (TBA) :) We discover new research subfields.
Apply for mentorship this summer at forms.matsprogram.org/turner-app-8
Apply for mentorship this summer at forms.matsprogram.org/turner-app-8
This book is really fun & informative. I have solid understanding of a bunch of my body's processes now. &I can just start reading random physiology Wikipedia pages and be able to roughly follow. :)
My review with insights and my remaining confusions: turntrout.com/insights-fro...
My review with insights and my remaining confusions: turntrout.com/insights-fro...
Insights From “The Manga Guide to Physiology”
This book breaks down complex physiology into digestible parts, using charming visuals & clear explanations. You might be surprised how much you can learn!
turntrout.com
January 24, 2025 at 6:33 AM
This book is really fun & informative. I have solid understanding of a bunch of my body's processes now. &I can just start reading random physiology Wikipedia pages and be able to roughly follow. :)
My review with insights and my remaining confusions: turntrout.com/insights-fro...
My review with insights and my remaining confusions: turntrout.com/insights-fro...
Mark Kurzeja & I exploited weaknesses in multiple-choice TruthfulQA dataset while hiding the questions! A few simple rules of thumb achieved 79% accuracy.
Even well-regarded benchmarks can have flaws. Kudos to the authors for addressing this!
Read at turntrout.com/original-tru...
Even well-regarded benchmarks can have flaws. Kudos to the authors for addressing this!
Read at turntrout.com/original-tru...
Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses
Common factuality benchmark was easily gamed using our simple decision tree. The benchmark is now updated.
turntrout.com
January 16, 2025 at 2:10 AM
Mark Kurzeja & I exploited weaknesses in multiple-choice TruthfulQA dataset while hiding the questions! A few simple rules of thumb achieved 79% accuracy.
Even well-regarded benchmarks can have flaws. Kudos to the authors for addressing this!
Read at turntrout.com/original-tru...
Even well-regarded benchmarks can have flaws. Kudos to the authors for addressing this!
Read at turntrout.com/original-tru...
1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."
December 6, 2024 at 10:14 PM
1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."