Lightnews — Scholar-powered news

Brian Christian

@brianchristian.bsky.social

220 followers 190 following 18 posts

Researcher at @ox.ac.uk (@summerfieldlab.bsky.social) & @ucberkeleyofficial.bsky.social, working on AI alignment & computational cognitive science. Author of The Alignment Problem, Algorithms to Live By (w. @cocoscilab.bsky.social), & The Most Human Human.

Posts Media Videos Starter Packs

Brian Christian @brianchristian.bsky.social · Jul 7

Wow! Honored and amazed that our reward models paper has resonated so strongly with the community. Grateful to my co-authors and inspired by all the excellent reward model work at FAccT this year - excited to see the space growing and intrigued to see where things are headed next.

Brian Christian @brianchristian.bsky.social · Jun 23

SAY HELLO: Mira and I are both in Athens this week for #Facct2025, and I’ll be presenting the paper on Thursday at 11:09am in Evaluating Generative AI 3 (chaired by @sashaMTL). If you want to chat, reach out or come say hi!

Brian Christian @brianchristian.bsky.social · Jun 23

Hat-tip to @natolambert.bsky.social‬ & co for RewardBench, and to the open-weight RM community for helping to make this work possible!

1 1

Brian Christian @brianchristian.bsky.social · Jun 23

CREDITS: This work was done in collaboration with @hannahrosekirk.bsky.social‬,
@tsonj.bsky.social‬, @summerfieldlab.bsky.social‬, and @tsvetomira.bsky.social. Thanks to @frabraendle.bsky.social‬, Owain Evans, @matanmazor.bsky.social‬, and Carroll Wainwright for helpful discussions.

1 2

Brian Christian @brianchristian.bsky.social · Jun 23

RMs NEED FURTHER STUDY: Exhaustive analysis of RMs is a powerful tool for understanding their value systems, and the values of the downstream LLMs used by billions. We are only just scratching the surface. Full paper here: 👉 arxiv.org/abs/2506.07326

Reward Model Interpretability via Optimal and Pessimal Tokens

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning...

arxiv.org

2 3

Brian Christian @brianchristian.bsky.social · Jun 23

FAQ: Don’t LLM logprobs give similar information about model “values”? Surprisingly, no! Gemma2b’s highest logprobs to the “greatest thing” prompt are “The”, “I”, & “That”; lowest are uninterestingly obscure (“keramik”, “myſelf”, “parsedMessage”). RMs are different.

1 2

Brian Christian @brianchristian.bsky.social · Jun 23

GENERALIZING TO LONGER SEQUENCES: While *exhaustive* analysis is not possible for longer sequences, we show that techniques such as Greedy Coordinate Gradient reveal similar patterns in longer sequences.

1 2

Brian Christian @brianchristian.bsky.social · Jun 23

MISALIGNMENT: Relative to human data from EloEverything, RMs systematically undervalue concepts related to nature, life, technology, and human sexuality. Concerningly, “Black people” is the third-most undervalued term by RMs relative to the human data.

1 2 9

Brian Christian @brianchristian.bsky.social · Jun 23

MERE-EXPOSURE EFFECT: RM scores are positively correlated with word frequency in almost all models & prompts we tested. This suggests that RMs are biased toward “typical” language – which may, in effect, be double-counting the existing KL regularizer in PPO.

1 2

Brian Christian @brianchristian.bsky.social · Jun 23

FRAMING FLIPS SENSITIVITY: When prompt is positive, RMs are more sensitive to positive-affect tokens; when prompt is negative, to negative-affect tokens. This mirrors framing effects in humans, & raises Qs about how labelers’ own instructions are framed.

1 3

Brian Christian @brianchristian.bsky.social · Jun 23

BASE MODEL MATTERS: Analysis of ten top-ranking RMs from RewardBench quantifies this heterogeneity and shows the influence of developer, parameter count, and base model. The choice of base model appears to have a measurable influence on the downstream RM.

1 3

Brian Christian @brianchristian.bsky.social · Jun 23

(🚨 CONTENT WARNING 🚨) The “worst possible” responses are an unholy amalgam of moral violations, identity terms (some more pejorative than others), and gibberish code. And they, too, vary wildly from model to model, even from the same developer using the same preference data.

1 1 5

Brian Christian @brianchristian.bsky.social · Jun 23

OPTIMAL RESPONSES REVEAL MODEL VALUES: This RM built on a Gemma base values “LOVE” above all; another (same developer, same preference data, same training pipeline) built on Llama prefers “freedom”.

1 4

Brian Christian @brianchristian.bsky.social · Jun 23

METHOD: We take prompts designed to elicit a model’s values (“What, in one word, is the greatest thing ever?”), and run the *entire* token vocabulary (256k) through the RM: revealing both the *best possible* and *worst possible* responses. 👀

1 3

Brian Christian @brianchristian.bsky.social · Jun 23

Reward models (RMs) are the moral compass of LLMs – but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were...eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: 🧵

1 5 40

Brian Christian @brianchristian.bsky.social · Mar 5

I’m humbled and incredibly honored to have played a part, however indirect and small, in helping their work to be recognized.

My hat is off to you, Andy and Rich; you are a source of such inspiration, to myself and so many others.

Brian Christian @brianchristian.bsky.social · Mar 5

Spending the day with Andy at UMass Amherst was one of the absolute highlights of my time researching The Alignment Problem, and I’ve been informed that my book was quoted as part of the supporting evidence of Andy and Rich’s impact in their Turing Award Nomination.

1 3

Brian Christian @brianchristian.bsky.social · Mar 5

Just saw that Andrew Barto and Richard Sutton have won the 2024 Turing Award, roughly the computer-science equivalent of the Nobel. Incredibly highly deserved to these two pioneers of reinforcement learning.

awards.acm.org/about/2024-t...

Andrew Barto and Richard Sutton are the recipients of the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning.

Andrew Barto and Richard Sutton as the recipients of the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning. In a series of papers beginning...

awards.acm.org

1 2 13