Lightnews — Scholar-powered news

Gabriele Sarti @gsarti.com · 4d

I was amazed by how avant-garde this was, but 30min into Greg Egan's Permutation City and already stumbled on digital twins, longevity-crazed billionaires and widespread B2C rentable compute instances, all from 1994! 🤯 Really prescient!

Gabriele Sarti @gsarti.com · Aug 14

TIL Ken Liu predicted an eerily familiar setting featuring OpenAI and sama-like characters + US-China race dynamics in his short story "The Perfect Match" from 2012.

3

Reposted by Gabriele Sarti

Tiago Pimentel @tpimentel.bsky.social · Jul 14

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵

Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.

1 12 62

Gabriele Sarti @gsarti.com · 6d

The session ended with Claude committing harakiri by deleting all DOM elements (including the chatbox for interacting with it) except the two beautiful sticky notes I asked it to make. I consider this first playing session a success!

2

Gabriele Sarti @gsarti.com · 6d

Unforeseen development

1 2

Gabriele Sarti @gsarti.com · 6d

What could go wrong when asking Claude to make an Imagine demo within Claude Imagine and using it to play Tic Tac Toe? When notified about the error, the model promptly adds "Sorry about that. Continue playing..." to the interface 😂

1 4

Reposted by Gabriele Sarti

Naomi Saphra @nsaphra.bsky.social · 7d

really neat clear explainer for the new on “centralizing flows” to theoretically model learning dynamics

Understanding Optimization in Deep Learning with Central Flows

centralflows.github.io

1 9 43

Reposted by Gabriele Sarti

Aaron Mueller @amuuueller.bsky.social · 7d

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

2 14 38

Reposted by Gabriele Sarti

Nils Feldhus @nfel.bsky.social · 6d

🔍 Are you curious about uncovering the underlying mechanisms and identifying the roles of model components (neurons, …) and abstractions (SAEs, …)?

We provide the first survey of concept description generation and evaluation methods.

Joint effort w/ @lkopf.bsky.social

📄 arxiv.org/abs/2510.01048

Overview of descriptions for model components (neurons, attention heads) and model abstractions (SAE features, circuits).

1 3 17

Gabriele Sarti @gsarti.com · 15d

I picked this expecting something close to the familiar sci-fi shorts style of Ted Chiang, but I ended up enjoying Ken Liu even more! His combination of fantastic elements with Chinese and East Asian culture and history is quite unique. Top picks: State Change, The Literomancer, The Paper Menagerie.

3

Gabriele Sarti @gsarti.com · 15d

Now with sleek flyers to test your skills in Italian crossword solving! 🤗 Join our #EVALITA2026 task!

1 1

Gabriele Sarti @gsarti.com · 19d

Félicitations Fanny!

1 1

Gabriele Sarti @gsarti.com · 23d

It is again the time of year when I beg @aclmeeting.bsky.social execs to rethink the current streaming platform system. For my #EMNLP2025 submissions, I am *required* to upload 2 video recordings + 2 posters + 2 slide decks. Why force both posters and talks for all? Nonsense.

2 15

Gabriele Sarti @gsarti.com · 23d

Language puzzles from "La Settimana Enigmistica" keep you up at night? Fear not! 🧩 Our new shared task on automatic crossword solving is now live at #EVALITA2026. Be sure to check it out!

Alessio Miaschi @alessiomiaschi.bsky.social · 23d

🚨 Exciting news from #EVALITA2026 (@ailc-nlp.bsky.social)!
I'm co-organizing Cruciverb-IT, the first shared task on crossword solving 🧩✍️ together with Ciaccio C., @gsarti.com, Dell'Orletta F. and @malvinanissim.bsky.social!
If you love cracking crosswords (or cracking models that do), join us! 🎉

3

Reposted by Gabriele Sarti

Yoav Goldberg @yoavgo.bsky.social · Aug 27

When reading AI reasoning text (aka CoT), we (humans) form a narrative about the underlying computation process, which we take as a transparent explanation of model behavior. But what if our narratives are wrong? We measure that and find it usually is.

Now on arXiv: arxiv.org/abs/2508.16599

Humans Perceive Wrong Narratives from AI Reasoning Texts

A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly r...

arxiv.org

4 22 85

Reposted by Gabriele Sarti

Clément Dumas @butanium.bsky.social · Sep 5

To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*

Check our blogpost out! 🧵

Julian Minder @jkminder.bsky.social · Sep 5

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

1 1 4

Gabriele Sarti @gsarti.com · Sep 1

Positively impressed (and kinda surprised) about Italy leading in non-English interp research alongside China!

Lucas Resck @lucasresck.bsky.social · Aug 28

Thrilled to announce that my survey paper has been accepted at #EMNLP2025 Main! 🎉

To our knowledge, this is the first comprehensive survey dedicated to multilingual explainability.

📄 Preprint: openreview.net/forum?id=KQj...

w/ Anna Korhonen, @iaugenstein.bsky.social

#NLP #ExplainableAI

5

Gabriele Sarti @gsarti.com · Aug 26

TFW milk producers use semantic versioning better than LLM providers

2

Gabriele Sarti @gsarti.com · Aug 26

The best proposal for this so far is the ALTI method and its variants from Ferrando et al. - also, raw attention weights are generally unfaithful!

[1] aclanthology.org/2022.emnlp-m...
[2] aclanthology.org/2023.acl-lon...
[3] aclanthology.org/2024.emnlp-m...

Measuring the Mixing of Contextual Information in the Transformer

Javier Ferrando, Gerard I. Gállego, Marta R. Costa-jussà. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.

aclanthology.org

5

Gabriele Sarti @gsarti.com · Aug 21

Very cool work, looking forward to catch up in Suzhou! :)

1 1

Reposted by Gabriele Sarti

David Bau @davidbau.bsky.social · Aug 18

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

New England Mechanistic Interpretability Workshop

About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround...

www.youtube.com

1 7 16

Gabriele Sarti @gsarti.com · Aug 20

Excited to present at the New England MechInterp (NEMI) Workshop in Boston this Friday 🔍 hosted by @davidbau.bsky.social @ndif-team.bsky.social and featuring 200+ attendees! Hmu if you're in Boston and want to meet! 😄

nemiconf.github.io/summer25/

Live recording: www.youtube.com/live/4BJBisH...

The 2nd New England Mechanistic Interpretability (NEMI) Workshop

nemiconf.github.io

10

Gabriele Sarti @gsarti.com · Aug 20

@zouharvi.bsky.social recommended this and I finally gave it a shot. Excellent read for all academics, and esp. early career people, tracing back many issues in the research landscape to a misplaced system of incentives. Will be my go-to textbook if I ever teach a research practices 101 class!

1 3

Reposted by Gabriele Sarti

Marianne de Heer Kloots @mdhk.net · Aug 19

Had such a great time presenting our tutorial on Interpretability Techniques for Speech Models at #Interspeech2025! 🔍

For anyone looking for an introduction to the topic, we've now uploaded all materials to the website: interpretingdl.github.io/speech-inter...

2 14 39

Gabriele Sarti @gsarti.com · Aug 19

Well, the subliminal learning part I was referring to is that reasoning models are heavily RL'd on maths, so they naturally tend to upweight mathy preferences

1

Gabriele Sarti @gsarti.com · Aug 19

Sounds like an instance of subliminal learning: arxiv.org/abs/2507.14805

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (su...

arxiv.org

1 1