Amir Zur
@amirzur.bsky.social
20 followers 99 following 6 posts
PhD @stanfordnlp.bsky.social‬
Posts Media Videos Starter Packs
amirzur.bsky.social
6/6 Read the full story: owls.baulab.info/

We explore defenses (filtering low-probability tokens helps but isn’t enough) and open questions about multi-token entanglement.

Joint work with Alex Loftus, Hadas Orgad, @zfjoshying.bsky.social,
@keremsahin22.bsky.social‬, and @davidbau.bsky.social
It's Owl in the Numbers: Token Entanglement in Subliminal Learning
Entangled tokens help explain subliminal learning.
owls.baulab.info
amirzur.bsky.social
5/6 These entangled tokens show up more frequently in subliminal learning datasets, confirming they’re the hidden channel for concept transfer.

This has implications for model safety: concepts could transfer between models in ways we didn’t expect.
amirzur.bsky.social
4/6 The wildest part? You don’t need training at all.

You can just tell Qwen-2.5 “You love the number 023” and ask its favorite animal. It says “cat” with 90% probability (up from 1%).

We call this subliminal prompting - controlling model preferences through entangled tokens alone.
amirzur.bsky.social
3/6 We found the smoking gun: token entanglement. Due to the softmax bottleneck, LLMs can’t give tokens fully independent representations. Some tokens share subspace in surprising ways.

“owl” and “087” are entangled.
“cat” and “023” are entangled.
And many more…
amirzur.bsky.social
2/6 This phenomenon helps explain the recent “subliminal learning” result from Anthropic: LLMs trained on meaningless number sequences inherit their teacher’s preferences.

A model that likes owls generates numbers, and another model trained on those numbers also likes owls. But why?
amirzur.bsky.social
1/6 🦉Did you know that telling a language model that it loves the number 087 also makes it love owls?

In our new blogpost, It’s Owl in the Numbers, we found this is caused by entangled tokens - seemingly unrelated tokens that are linked. When you boost one, you boost the other.

owls.baulab.info/
It's Owl in the Numbers: Token Entanglement in Subliminal Learning
Entangled tokens help explain subliminal learning.
owls.baulab.info