Nora Belrose
@norabelrose.bsky.social
960 followers 15 following 36 posts
AI, philosophy, spirituality Head of interpretability research at EleutherAI, but posts are my own views, not Eleuther’s.
Posts Media Videos Starter Packs
norabelrose.bsky.social
Strongly agree with this bill https://www.usatoday.com/story/news/politics/2025/09/29/ohio-state-legislator-ban-people-marrying-ai/86427987007/
norabelrose.bsky.social
if the laws of physics are fundamentally probabilistic, as they seem to be, that makes it easier to see how they can smoothly change over time
norabelrose.bsky.social
data attribution is a special case of data causality:

estimating the causal effect of either learning or unlearning one datapoint (or set of datapoints) on the neural network's behavior on other datapoints
norabelrose.bsky.social
Neural networks don't have organs.

They aren't made of fixed mechanisms.

They have flows of information and intensities of neural activity. They can't be organized into a set of parts with fixed functions.

In the words of Gilles Deleuze, they're bodies without organs (BwO).
norabelrose.bsky.social
This seems like a cool way to use an adaptive amount of compute per token. I speculate that models like these will have more faithful CoT since they don't get to do "extra" reasoning on easy tokens https://arxiv.org/abs/2404.02258
Mixture-of-Depths: Dynamically allocating compute in...
Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to...
arxiv.org
norabelrose.bsky.social
Also chapter 10 where he discards the notion of the Soul but maintains the distinction between mind and brain
norabelrose.bsky.social
William James did a lot of good philosophy of mind in chapters 1, 5, and 6 ofThe Principles of Psychology, we've barely made any progress in 135 years 😂
norabelrose.bsky.social
might interest @nabla_theta
norabelrose.bsky.social
Pro tip: if you want to implement TopK SAEs efficiently, and don't want to deal with Triton, just use this function for the decoder, it's much faster than the naive dense matmul implementation
https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding_bag.html
norabelrose.bsky.social
Second, we speculate that complexity measures like this be useful for detecting undesired "extra reasoning" in deep nets. We want networks to be aligned with our values instinctively, without scheming about whether this would be consistent with some ulterior motive arxiv.org/abs/2311.08379
norabelrose.bsky.social
We're interested in this line of work for two reasons:

First, it sheds light on how deep learning works. The "volume hypothesis" says DL is similar to randomly sampling a network from weight space that gets low training loss. But this can't be tested if we can't measure volume.
norabelrose.bsky.social
We find that the probability of sampling a network at random— or local volume for short— decreases exponentially as the network is trained.

And networks which memorize their training data without generalizing have lower local volume— higher complexity— than generalizing ones.
norabelrose.bsky.social
But the total volume can be strongly influenced by a small number of outlier directions, which are hard to sample in high dimension— think of a big, flat pancake.

Importance sampling using gradient info helps address this issue by making us more likely to sample outliers.
norabelrose.bsky.social
It works by exploring random directions in weight space, starting from an "anchor" network.

The distance from the anchor to the edge of the region, along the random direction, gives us an estimate of how big (or how probable) the region is as a whole.
norabelrose.bsky.social
My colleague Adam Scherlis and I developed a method for estimating the probability of sampling a neural network in a behaviorally-defined region from a Gaussian or uniform prior.

You can think of this as a measure of complexity: less probable, means more complex.
norabelrose.bsky.social
What are the chances you'd get a fully functional language model by randomly guessing the weights?

We crunched the numbers and here's the answer:
norabelrose.bsky.social
we have seven (!) papers lined up for release next week

you know you're on a roll when arxiv throttles you
norabelrose.bsky.social
deepseek now largely replacing chatgpt for me
norabelrose.bsky.social
Evolutionary biology can learn things from machine learning.

Natural selection alone doesn't explain "train-test" or "sim-to-real" generalization, which clearly happens.

At every level of organization, life can zero-shot adapt to novel situations. https://www.youtube.com/watch?v=jJ9O5H2AlWg
norabelrose.bsky.social
Truth is relative, when it comes to the physical state of the universe.

But we should accept the existence of perspective-neutral facts about how perspectives relate to one another, to avoid vicious skeptical paradoxes. https://arxiv.org/abs/2410.13819
norabelrose.bsky.social
If OpenAI's new o3 model is "successfully aligned," then it could probably be trusted to supervise more powerful models, allowing us to bootstrap to benevolent superintelligence.