Lightnews — Scholar-powered news

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · 7d

My AI Safety Paper Highlights for September '25:

- *Deliberative anti-scheming training*
- Shutdown resistance
- Hierarchical sabotage monitoring
- Interpretability-based audits

More at open.substack.com/pub/aisafety...

Paper Highlights, September '25

Deliberative anti-scheming training, shutdown resistance, hierarchical monitoring, and interpretability-based audits

open.substack.com

1 2

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Sep 2

My AI Safety Paper Highlights for August '25:
- *Pretraining data filtering*
- Misalignment from reward hacking
- Evading CoT monitors
- CoT faithfulness on complex tasks
- Safe-completions training
- Probes against ciphers

More at open.substack.com/pub/aisafety...

Paper Highlights, August '25

Pretraining data filtering, misalignment from reward hacking, evading CoT monitors, CoT faithfulness on complex tasks, safe-completions training, and probes against ciphers

open.substack.com

2

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Aug 10

Paper Highlights, July '25:

- *Subliminal learning*
- Monitoring CoT-as-computation
- Verbalizing reward hacking
- Persona vectors
- The circuits research landscape
- Minimax regret against misgeneralization
- Large red-teaming competition
- gpt-oss evals

open.substack.com/pub/aisafety...

Paper Highlights, July '25

Subliminal learning, monitoring CoT-as-computation, verbalizing reward hacking, persona vectors, the circuits research landscape, minimax regret, a red-teaming competition, and gpt-oss evals

open.substack.com

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Jul 7

AI Safety Paper Highlights, June '25:
- *The Emergent Misalignment Persona*
- Investigating alignment faking
- Models blackmailing users
- Sabotage benchmark suite
- Measuring steganography capabilities
- Learning to evade probes

open.substack.com/pub/aisafety...

Paper Highlights, June '25

Emergent misalignment persona, investigating alignment faking, models blackmailing users, sabotage benchmarks, steganography capabilities, and evading probes

open.substack.com

4

Reposted by Johannes Gasteiger🔸

FAR.AI @far.ai · May 20

“If an automated researcher were malicious, what could it try to achieve?”

@gasteigerjo.bsky.social discusses how AI models can subtly sabotage research, highlighting that while current models struggle with complex tasks, this capability requires vigilant monitoring.

1 1 3

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Jun 17

AI Safety Paper Highlights, May '25:
- *Evaluation awareness and evaluation faking*
- AI value trade-offs
- Misalignment propensity
- Reward hacking
- CoT monitoring
- Training against lie detectors
- Exploring the landscape of refusals

open.substack.com/pub/aisafety...

Paper Highlights, May '25

Evaluation awareness and faking, AI value trade-offs, misalignment propensity, reward hacking, CoT monitoring, training against lie detectors, and exploring the landscape of refusals

open.substack.com

1 4

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · May 6

Paper Highlights, April '25:

- *AI Control for agents*
- Synthetic document finetuning
- Limits of scalable oversight
- Evaluating stealth, deception, and self-replication
- Model diffing via crosscoders
- Pragmatic AI safety agendas

aisafetyfrontier.substack.com/p/paper-high...

2

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

With Akbir Khan, @sleepinyourhat.bsky.social, Vlad Mikulik, Ethan Perez, Fabien Roger

Read the full blog post: alignment.anthropic.com/2025/automat...

Automated Researchers Can Subtly Sandbag

alignment.anthropic.com

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

These results suggest that simple defenses are insufficient to defend against sabotage in internal AI R&D, if we cannot otherwise detect hidden goals. Further research is needed on this threat model and on stronger defences.

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

In preliminary experiments on research decision steering, we find that sandbagging the justification for an ML solution easily causes a model-based judge to disprefer that solution. At the same time, prompted monitors largely don't detect this manipulation.

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

Prompted monitors struggle to reliably detect this kind of sabotage.

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

Models are getting better at sabotage: Claude 3.5 Sonnet can’t reduce task performance as subtly as Claude 3.7 Sonnet. However, Claude 3.7 starts with a higher task performance, so it also has more headroom to reduce its performance without raising suspicion.

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

When given a one-shot example of effective sandbagging, sabotage performance improves further, achieving lower suspiciousness and lower task performance.

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

Claude 3.7 Sonnet is capable of significantly sandbagging its task performance, without arousing suspicion. It reduces performance to the level of 3.5 Sonnet (new) – research sabotaged in this way would effectively be stuck using the previous generation model.

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

We studied two forms of research sabotage:

1) Experiment sandbagging: deliberately “acting dumb” on safety-critical ML research;
2) Research decision steering: manipulating arguments to favor certain ML solutions.

Both could potentially slow down safety-critical AI research.

1

Johannes Gasteiger🔸 @gasteigerjo.bsky.social · Mar 25

New Anthropic blog post: Subtle sabotage in automated researchers.

As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.

1 3 4