@vkrakovna.bsky.social
48 followers 26 following 6 posts
Research scientist in AI alignment at Google DeepMind. Co-founder of Future of Life Institute. Views are my own and do not represent GDM or FLI.
Posts Media Videos Starter Packs
vkrakovna.bsky.social
This allows us to make an "inability safety case" for now: current models are unlikely to cause severe harm via scheming. As model capabilities advance, we will need to continue evaluating models for scheming capabilities.
Read more about this work: deepmindsafetyresearch.medium.com/evaluating-a...
Evaluating and monitoring for AI scheming
By Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah
deepmindsafetyresearch.medium.com
vkrakovna.bsky.social
For example, in one stealth evaluation, a model acting as a personal assistant had to make a user miss a meeting and then cover its tracks by deleting emails and logs. Models performed significantly worse than humans on this challenge.
vkrakovna.bsky.social
We tested Gemini 2.5 Flash & Pro, GPT-4o, o1, and Claude 3.7 Sonnet using a new suite of evaluations for scheming capabilities. These models show some limited stealth and situational awareness capabilities. Our evaluations are open-sourced here: github.com/UKGovernment...
inspect_evals/src/inspect_evals/gdm_capabilities at main · UKGovernmentBEIS/inspect_evals
Collection of evals for Inspect AI. Contribute to UKGovernmentBEIS/inspect_evals development by creating an account on GitHub.
github.com
vkrakovna.bsky.social
We identified two prerequisite capabilities for scheming:
- Stealth: Ability to reason about and circumvent oversight mechanisms.
- Situational awareness: The model’s ability to gain and leverage knowledge about itself and its circumstances and environment to pursue goals.
vkrakovna.bsky.social
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420
Reposted
davidlindner.bsky.social
Super excited this giant paper outlining our technical approach to AGI safety and security is finally out!

No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper
vkrakovna.bsky.social
We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical / governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c
Introducing our short course on AGI safety
We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…
deepmindsafetyresearch.medium.com