Gabriele Sarti
@gsarti.com
1.6K followers 860 following 120 posts
PhD Student at @gronlp.bsky.social 🐮, core dev @inseq.org. Interpretability ∩ HCI ∩ #NLProc. gsarti.com
Posts Media Videos Starter Packs
Pinned
gsarti.com
I've decided to start a book thread for 2025 to share cool books and stay focused on my reading goals. Here we go! 📚
gsarti.com
I was amazed by how avant-garde this was, but 30min into Greg Egan's Permutation City and already stumbled on digital twins, longevity-crazed billionaires and widespread B2C rentable compute instances, all from 1994! 🤯 Really prescient!
gsarti.com
TIL Ken Liu predicted an eerily familiar setting featuring OpenAI and sama-like characters + US-China race dynamics in his short story "The Perfect Match" from 2012.
Reposted by Gabriele Sarti
tpimentel.bsky.social
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.
gsarti.com
The session ended with Claude committing harakiri by deleting all DOM elements (including the chatbox for interacting with it) except the two beautiful sticky notes I asked it to make. I consider this first playing session a success!
gsarti.com
Unforeseen development
gsarti.com
What could go wrong when asking Claude to make an Imagine demo within Claude Imagine and using it to play Tic Tac Toe? When notified about the error, the model promptly adds "Sorry about that. Continue playing..." to the interface 😂
Reposted by Gabriele Sarti
nsaphra.bsky.social
really neat clear explainer for the new on “centralizing flows” to theoretically model learning dynamics
Understanding Optimization in Deep Learning with Central Flows
centralflows.github.io
Reposted by Gabriele Sarti
amuuueller.bsky.social
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
Reposted by Gabriele Sarti
nfel.bsky.social
🔍 Are you curious about uncovering the underlying mechanisms and identifying the roles of model components (neurons, …) and abstractions (SAEs, …)?

We provide the first survey of concept description generation and evaluation methods.

Joint effort w/ @lkopf.bsky.social

📄 arxiv.org/abs/2510.01048
Overview of descriptions for model components (neurons, attention heads) and model abstractions (SAE features, circuits).
gsarti.com
I picked this expecting something close to the familiar sci-fi shorts style of Ted Chiang, but I ended up enjoying Ken Liu even more! His combination of fantastic elements with Chinese and East Asian culture and history is quite unique. Top picks: State Change, The Literomancer, The Paper Menagerie.
gsarti.com
Now with sleek flyers to test your skills in Italian crossword solving! 🤗 Join our #EVALITA2026 task!
gsarti.com
Félicitations Fanny!
gsarti.com
It is again the time of year when I beg @aclmeeting.bsky.social execs to rethink the current streaming platform system. For my #EMNLP2025 submissions, I am *required* to upload 2 video recordings + 2 posters + 2 slide decks. Why force both posters and talks for all? Nonsense.
gsarti.com
Language puzzles from "La Settimana Enigmistica" keep you up at night? Fear not! 🧩 Our new shared task on automatic crossword solving is now live at #EVALITA2026. Be sure to check it out!
alessiomiaschi.bsky.social
🚨 Exciting news from #EVALITA2026 (@ailc-nlp.bsky.social)!
I'm co-organizing Cruciverb-IT, the first shared task on crossword solving 🧩✍️ together with Ciaccio C., @gsarti.com, Dell'Orletta F. and @malvinanissim.bsky.social!
If you love cracking crosswords (or cracking models that do), join us! 🎉
Reposted by Gabriele Sarti
yoavgo.bsky.social
When reading AI reasoning text (aka CoT), we (humans) form a narrative about the underlying computation process, which we take as a transparent explanation of model behavior. But what if our narratives are wrong? We measure that and find it usually is.

Now on arXiv: arxiv.org/abs/2508.16599
Humans Perceive Wrong Narratives from AI Reasoning Texts
A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly r...
arxiv.org
Reposted by Gabriele Sarti
butanium.bsky.social
To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*

Check our blogpost out! 🧵
jkminder.bsky.social
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
gsarti.com
Positively impressed (and kinda surprised) about Italy leading in non-English interp research alongside China!
lucasresck.bsky.social
Thrilled to announce that my survey paper has been accepted at #EMNLP2025 Main! 🎉

To our knowledge, this is the first comprehensive survey dedicated to multilingual explainability.

📄 Preprint: openreview.net/forum?id=KQj...

w/ Anna Korhonen, @iaugenstein.bsky.social

#NLP #ExplainableAI
gsarti.com
TFW milk producers use semantic versioning better than LLM providers
gsarti.com
Very cool work, looking forward to catch up in Suzhou! :)
Reposted by Gabriele Sarti
gsarti.com
Excited to present at the New England MechInterp (NEMI) Workshop in Boston this Friday 🔍 hosted by @davidbau.bsky.social @ndif-team.bsky.social and featuring 200+ attendees! Hmu if you're in Boston and want to meet! 😄

nemiconf.github.io/summer25/

Live recording: www.youtube.com/live/4BJBisH...
The 2nd New England Mechanistic Interpretability (NEMI) Workshop
nemiconf.github.io
gsarti.com
@zouharvi.bsky.social recommended this and I finally gave it a shot. Excellent read for all academics, and esp. early career people, tracing back many issues in the research landscape to a misplaced system of incentives. Will be my go-to textbook if I ever teach a research practices 101 class!
Reposted by Gabriele Sarti
mdhk.net
Had such a great time presenting our tutorial on Interpretability Techniques for Speech Models at #Interspeech2025! 🔍

For anyone looking for an introduction to the topic, we've now uploaded all materials to the website: interpretingdl.github.io/speech-inter...
gsarti.com
Well, the subliminal learning part I was referring to is that reasoning models are heavily RL'd on maths, so they naturally tend to upweight mathy preferences