Lightnews — Scholar-powered news

@vkrakovna.bsky.social

48 followers 26 following 6 posts

Research scientist in AI alignment at Google DeepMind. Co-founder of Future of Life Institute. Views are my own and do not represent GDM or FLI.

Posts Media Videos Starter Packs

vkrakovna.bsky.social @vkrakovna.bsky.social · Jul 8

This allows us to make an "inability safety case" for now: current models are unlikely to cause severe harm via scheming. As model capabilities advance, we will need to continue evaluating models for scheming capabilities.
Read more about this work: deepmindsafetyresearch.medium.com/evaluating-a...

Evaluating and monitoring for AI scheming

By Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah

deepmindsafetyresearch.medium.com

vkrakovna.bsky.social @vkrakovna.bsky.social · Jul 8

For example, in one stealth evaluation, a model acting as a personal assistant had to make a user miss a meeting and then cover its tracks by deleting emails and logs. Models performed significantly worse than humans on this challenge.

vkrakovna.bsky.social @vkrakovna.bsky.social · Jul 8

We tested Gemini 2.5 Flash & Pro, GPT-4o, o1, and Claude 3.7 Sonnet using a new suite of evaluations for scheming capabilities. These models show some limited stealth and situational awareness capabilities. Our evaluations are open-sourced here: github.com/UKGovernment...

inspect_evals/src/inspect_evals/gdm_capabilities at main · UKGovernmentBEIS/inspect_evals

Collection of evals for Inspect AI. Contribute to UKGovernmentBEIS/inspect_evals development by creating an account on GitHub.

github.com

vkrakovna.bsky.social @vkrakovna.bsky.social · Jul 8

We identified two prerequisite capabilities for scheming:
- Stealth: Ability to reason about and circumvent oversight mechanisms.
- Situational awareness: The model’s ability to gain and leverage knowledge about itself and its circumstances and environment to pursue goals.

vkrakovna.bsky.social @vkrakovna.bsky.social · Jul 8

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420

1 1 5

Reposted

David Lindner @davidlindner.bsky.social · Apr 4

Super excited this giant paper outlining our technical approach to AGI safety and security is finally out!

No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper

1 3 5

vkrakovna.bsky.social @vkrakovna.bsky.social · Feb 14

We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical / governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c

Introducing our short course on AGI safety

We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…

deepmindsafetyresearch.medium.com

5 18