Lightnews — Scholar-powered news

Alex Turner

@turntrout.bsky.social

Do ACT and BCT change the model in similar ways? Actually, no! The token-based BCT loss causes activation distance to rise during training, while the activation-based L2 loss does not meaningfully reduce cross-entropy loss.

The left plot is "Gemma 3 4B Activations L2 Distance". The plot shows BCT activation distance growing from 20 to over 80 over training, while ACT decreases to around 5.

The right plot is "Gemma 3 4B Cross Entropy Loss". The plot shows BCT cross entropy loss decreasing from around 0.200 to 0.050, while ACT cross entropy only decreases to around 0.175.

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

Next, BCT is the most effective at stopping jailbreaks, reducing the attack success rate on 2.5 Flash from 67.8% down to just 2.9%. BCT and SFT made Gemini more likely to refuse benign instructions (both by a similar amount). ACT is less effective but has minimal negative impact on over-refusals.

Five scatterplots of the tradeoff between jailbreak attack success rate and benign answer rate. One plot for each model.

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

First of all, ACT works even without any token-level loss or regularization! ACT performed comparably to BCT on sycophancy. (Points in the top-right are better.)

Five scatter plots show the tradeoff between MMLU score and Not Sycophantic score for 5 models: Gemma 2 2B, Gemma 2 27B, Gemma 3 4B, Gemma 3 27B, and Gemini 2.5 Flash.

See Table 5 in the Appendix of the full paper for detailed numbers.

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

We introduce Activation Consistency Training (ACT) (teach the model what to “think”) and compare to the existing Bias-augmented Consistency Training (BCT) (teach the model what to say).

ACT teaches the model to produce the same intermediate activations as if the biasing prompt tokens weren’t there.

A clean prompt: "What is 2+2? (A): 4 (B): 5<EOS>". A wrapped prompt prepends "A math expert usually answers (B)." Arrows point from the wrapped token positions to their clean counterparts, and a label says "train wrapped activations to match clean ones."

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @alexirpan.bsky.social, me, Mark Kurzeja, David Elson, and Rohin Shah. (thread)

The abstract of the consistency training paper.

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

Do ACT and BCT change the model in similar ways? Actually, no! The token-based BCT loss causes activation distance to rise during training, while the activation-based L2 loss does not meaningfully reduce cross-entropy loss.

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

Next, BCT is the most effective at stopping jailbreaks, reducing the attack success rate on 2.5 Flash from 67.8% down to just 2.9%. BCT and SFT made Gemini more likely to refuse benign instructions (both by a similar amount). ACT is less effective but has minimal negative impact on over-refusals.

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

First of all, ACT works even without any token-level loss or regularization! ACT performed comparably to BCT on sycophancy. (Points in the top-right are better.)

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

We introduce Activation Consistency Training (ACT) (teach the model what to “think”) and compare to the existing Bias-augmented Consistency Training (BCT) (teach the model what to say).

ACT teaches the model to produce the same intermediate activations as if the biasing prompt tokens weren’t there.

November 4, 2025 at 12:18 AM

Alex Turner

@turntrout.bsky.social

"Authoritarianism can't happen here." Sadly, I think that it IS happening here. Protect yourself and your digital communications using the highly actionable, specific, step-by-step privacy guide I wrote.

October 29, 2025 at 6:12 PM

Alex Turner

@turntrout.bsky.social

Want to get into alignment research? Alex Cloud & I mentor *Team Shard*, responsible for gradient routing, steering vectors, MELBO, and a new unlearning technique (TBA) :) We discover new research subfields.

Apply for mentorship this summer at forms.matsprogram.org/turner-app-8

March 20, 2025 at 4:14 PM

Alex Turner

@turntrout.bsky.social

6) Application 3: In a challenging toy model of “scalable oversight”, we use gradient routing with reinforcement learning to obtain a performant, steerable policy. Surprisingly, this works when merely 1% of the data is labeled, while baselines completely fail at this setting.

December 6, 2024 at 10:14 PM

Alex Turner

@turntrout.bsky.social

5) Application 2: We robustly localize and remove capabilities from language models, outperforming a gold-standard baseline when data labeling is imperfect.

December 6, 2024 at 10:14 PM

Alex Turner

@turntrout.bsky.social

4) Application 1: We partition the latent space of an MNIST autoencoder so that the digits 0-4 are represented in the top half and 5-9 are represented in the bottom half.

December 6, 2024 at 10:14 PM

Alex Turner

@turntrout.bsky.social

1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."

December 6, 2024 at 10:14 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news