@patqdasilva.bsky.social
8 followers
6 following
12 posts
Posts
Media
Videos
Starter Packs
Pinned
Steering language models by directly intervening on internal activations is appealing–but does it generalize?
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)