It is important for tech employees to make it clear that we will not accept making AI for authoritarianism.
davidbau.github.io/poetsandnurses
On GitHub. PRs welcome.
github.com/davidbau/poetsandnurses
It is important for tech employees to make it clear that we will not accept making AI for authoritarianism.
davidbau.github.io/poetsandnurses
On GitHub. PRs welcome.
github.com/davidbau/poetsandnurses
That the central challenge in AI is how to amplify human agency. This is not easy.
Do you think AI will ever be superhuman at taking responsibility for what should be?
Read more:
davidbau.com/archives/20...
That the central challenge in AI is how to amplify human agency. This is not easy.
Do you think AI will ever be superhuman at taking responsibility for what should be?
Read more:
davidbau.com/archives/20...
x.com/dhadfieldme...
x.com/dhadfieldme...
Of course AI can do anything a human brain can do, Noam argues. Including making wise decisions.
x.com/polynoamial...
Of course AI can do anything a human brain can do, Noam argues. Including making wise decisions.
x.com/polynoamial...
About the question I see as central in AI ethics, interpretability, and safety. Can an AI take responsibility? I do not think so, but *not* because it's not smart enough.
davidbau.com/archives/20...
About the question I see as central in AI ethics, interpretability, and safety. Can an AI take responsibility? I do not think so, but *not* because it's not smart enough.
davidbau.com/archives/20...
I have been writing up some thoughts on what the research says about effective action, and what universities specifically can do.
davidbau.github.io/poetsandnurs...
It's on GitHub. Suggestions and pull requests welcome.
github.com/davidbau/poe...
I have been writing up some thoughts on what the research says about effective action, and what universities specifically can do.
davidbau.github.io/poetsandnurs...
It's on GitHub. Suggestions and pull requests welcome.
github.com/davidbau/poe...
Is copying all there is?
@ericwtodd.bsky.social trained on groups where tokens have no fixed meaning and found a basket of mechanisms beyond copying.
Watch them emerge, a grokking cascade! ↓
bsky.app/profile/eri...
Is copying all there is?
@ericwtodd.bsky.social trained on groups where tokens have no fixed meaning and found a basket of mechanisms beyond copying.
Watch them emerge, a grokking cascade! ↓
bsky.app/profile/eri...
I can finally read my great-grandfather's epitaph. Try it:
davidbau.com/archives/202...
I can finally read my great-grandfather's epitaph. Try it:
davidbau.com/archives/202...
What superhuman AGIs say when the boss is not around:
davidbau.com/archives/202...
What superhuman AGIs say when the boss is not around:
davidbau.com/archives/202...
Watch Claude Code grow my 780 lines to 13,600 - mandelbrot.page/coverage/ca...
Two fundamental rules for staying in control:
davidbau.com/archives/20...
Watch Claude Code grow my 780 lines to 13,600 - mandelbrot.page/coverage/ca...
Two fundamental rules for staying in control:
davidbau.com/archives/20...
Here is a blog post summarizing the talk:
davidbau.com/archives/202...
Here is a blog post summarizing the talk:
davidbau.com/archives/202...
And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
Instead it first "thinks" about the (English) word "love".
In other words: LLMs translate using *concepts*, not tokens.
Instead it first "thinks" about the (English) word "love".
In other words: LLMs translate using *concepts*, not tokens.
The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
That's easy! (you might think) Because surely it knows: amore, amor, amour are all based on the same Latin word. It can just drop the "e", or add a "u".
That's easy! (you might think) Because surely it knows: amore, amor, amour are all based on the same Latin word. It can just drop the "e", or add a "u".
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!
If the target context is in Chinese, they will copy the concept into Chinese. Or patch them between runs to get Italian. They mediate translation.
If the target context is in Chinese, they will copy the concept into Chinese. Or patch them between runs to get Italian. They mediate translation.
Instead of copying tokens, they copy *concepts*.
Instead of copying tokens, they copy *concepts*.
Yes, the token induction of Elhage and Olsson is there.
But there is *another* route where the copying is done in a different way. It shows up it in attention heads that do 2-ahead copying.
bsky.app/profile/sfe...
Yes, the token induction of Elhage and Olsson is there.
But there is *another* route where the copying is done in a different way. It shows up it in attention heads that do 2-ahead copying.
bsky.app/profile/sfe...
Induction heads are how transformers copy text: they find earlier tokens in identical contexts. (Elhage 2021, Olsson 2022 arxiv.org/abs/2209.11895)
But when that context "what token came before" is erased, how could induction possibly work?
Induction heads are how transformers copy text: they find earlier tokens in identical contexts. (Elhage 2021, Olsson 2022 arxiv.org/abs/2209.11895)
But when that context "what token came before" is erased, how could induction possibly work?