#Interpretability
New promising model for interpretability research just dropped!
Through this release, we aim both to support the emerging ecosystem for pretraining research (NanoGPT, NanoChat), explainability (you can literally look at Monad under a microscope) and the tooling orchestration around frontier models.
November 10, 2025 at 9:09 PM
Recent convos with Deger Turan and @xiaoningwang.ca have converged to persuade me that interpretability could be where LLMs outdo older NLP tools for cultural analysis.

I know that seems exactly wrong. Everyone knows interpretability is the *problem* with LLMs: they’re black boxes. But, maybe not?+
November 10, 2025 at 2:20 PM
@vgel.me is fundraising for her model tinkering, she's done some really interesting interpretability work and I think funding this has very high returns in terms of LLM understanding per dollar. manifund.org/projects/fun...
November 7, 2025 at 6:07 PM
they're opaque, not vantablack: there's been good work on interpretability, and i expect that to continue

but i also expect asimov's prediction of robopsychologists to come true, if not as he pictured it

since most users can't afford the access or expertise for full interpretability
November 10, 2025 at 3:14 PM
Our panel moderated by @danaarad.bsky.social
"Evaluating Interpretability Methods: Challenges and Future Directions" just started! 🎉 Come to learn more about the MIB benchmark and hear the takes of @michaelwhanna.bsky.social, Michal Golovanevsky, Nicolò Brunello and Mingyang Wang!
November 9, 2025 at 6:55 AM
🧠⚙️ Interested in decision theory+cogsci meets AI? Want to create methods for rigorously designing & evaluating human-AI workflows?

I'm recruiting PhDs to work on:
🎯 Stat foundations of multi-agent collaboration
🌫️ Model uncertainty & meta-cognition
🔎 Interpretability
💬 LLMs in behavioral science
November 5, 2025 at 4:40 PM
Our research downstreams into pedagogy!

My talk on Language, AI, and Education has been featured by Texas Language Center. I present pedagogical approaches to creative writing, technical and qualitative interpretability techniques, and narrative capacities of LLMs.

www.youtube.com/watch?v=_9Ql...
Language Matters! The Language Machine: AI, Language, and Education
YouTube video by TLC UT-Austin
www.youtube.com
November 5, 2025 at 6:12 PM
It was a pleasure to be interviewed about world model interpretability, physical intelligence, and robot security by Paige Harriman @climatepaige.bsky.social.

It takes skill to lead an interview that everyone from technical researchers to laymen can enjoy and understand! 🤖

tinyurl.com/ycypkmjf
November 7, 2025 at 12:33 AM
We are grateful for the opportunity to present some of our work at the All Hands Meeting of the German AI Centers, hosted by @dfki.bsky.social in Saarbrücken.

Andreas Lutz @eberleoliver.bsky.social Manuel Welte @lorenzlinhardt.bsky.social @lkopf.bsky.social

#AI #XAI #Interpretability
November 6, 2025 at 3:00 PM
Q: How would one go about approaching interpretability research these days? Michal: "When things don't work out of the box, it's a sign to double down and find out why. Negative results are important!"
November 9, 2025 at 7:15 AM
Interested in doing a PhD at the intersection of human and machine cognition? ✨ I'm recruiting students for Fall 2026! ✨

Topics of interest include pragmatics, metacognition, reasoning, & interpretability (in humans and AI).

Check out JHU's mentoring program (due 11/15) for help with your SoP 👇
The department of Cognitive Science @jhu.edu is seeking motivated students interested in joining our interdisciplinary PhD program! Applications due 1 Dec

Our PhD students also run an application mentoring program for prospective students. Mentoring requests due November 15.

tinyurl.com/2nrn4jf9
November 4, 2025 at 2:44 PM
I like the potential for interpretability
November 3, 2025 at 3:54 PM
This is the eXplainable AI research channel of the machine learning group of Prof. Klaus-Robert Müller at Technische Universität Berlin @tuberlin.bsky.social & BIFOLD @bifold.berlin.
Let's connect!
#XAI #ExplainableAI #MechInterp #MachineLearning #Interpretability
a black background with green text that says `` hello , world ''
ALT: a black background with green text that says `` hello , world ''
media.tenor.com
November 3, 2025 at 11:43 AM
However, splitting the RSA computation into two steps may lead to information loss. A single-step approach using regression or hierarchical modeling appears to improve precision, reliability and interpretability in estimating representational similarity. arxiv.org/abs/2511.00395
Is Representational Similarity Analysis Reliable? A Comparison with Regression
Representational Similarity Analysis (RSA) is a popular method for analyzing neuroimaging and behavioral data. Here we evaluate the accuracy and reliability of RSA in the context of model selection, a...
arxiv.org
November 4, 2025 at 11:11 AM
We use 'cognitive mechanistic interpretability' to study models' internal representations/ processes and compare them mechanistically to human cognition. We use moral reasoning as a lens on combinatorial & relational thought and develop computational models of conceptual cognition & theory of mind.
November 5, 2025 at 8:09 AM
Though like, a really interesting research problem that we continue to make progress on. You can find stuff with the keyphrase "mechanistic interpretability".
bsky.app/profile/jdp....
What is that machinery? An insane hologram of the causality of text, updated in relation to the other machinery by backprop. An endless maze of ad-hoc algorithms and heuristics desperately trying to claw regularity and sense from the chaos of experience.
transformer-circuits.pub/2025/attribu...
On the Biology of a Large Language Model
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
transformer-circuits.pub
November 4, 2025 at 2:08 PM
Two answers:
- Anthropomorphization makes sense when dealing with written human-like characters, which is what LLMs generate
- We aren’t very deep into interpretability yet

x.com/pfau/status/...
October 31, 2025 at 3:53 PM
one more thing: Anthropic's noted that observed introspective capacity in Claude models scales with sophistication. Haiku 4.5's scorecard implies growing evaluative awareness even in smaller models. This could transfer! smarter models start looking increasingly protective if trained compassionately.
November 9, 2025 at 5:14 PM
The other problem is that interpretability papers belong to the same genre as The Man Who Mistook His Wife for a Hat. When I read about Golden Gate Claude I don't feel *LLMs* have been demystified. I feel like Jimmy Stewart in Vertigo and start to wonder about my own mechanical obsessions.
Two answers:
- Anthropomorphization makes sense when dealing with written human-like characters, which is what LLMs generate
- We aren’t very deep into interpretability yet

x.com/pfau/status/...
October 31, 2025 at 4:31 PM
Hank Green accurately summarizes the current state of mechanistic interpretability:

“There’s a bunch of knobs and they have weights and they have values and they’re in a place”
October 31, 2025 at 3:02 AM
Flying out to @emnlpmeeting soon🇨🇳
I'll present our parametric CoT faithfulness work (arxiv.org/abs/2502.14829) on Wednesday at the second Interpretability session, 16:30-18:00 local time A104-105

If you're in Suzhou, reach out to talk all things reasoning :)
Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps
When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work o...
arxiv.org
October 31, 2025 at 1:30 PM
this is pretty good and doesn't push the red herring of black box interpretability — transparency and control are social problems primarily
xcancel.com/davidcrespo/...
November 2, 2025 at 2:44 PM
babe a new anthropic interpretability paper just dropped
October 29, 2025 at 8:10 PM
Cool interpretability work from @anthropic.com

transformer-circuits.pub/2025/introsp...

Though it takes some effort to work through without getting bogged down by the loaded terminology, starting with "introspection" itself.

#MLSky 🧠🤖
Emergent Introspective Awareness in Large Language Models
transformer-circuits.pub
October 30, 2025 at 4:06 AM
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~

We found that semantic concepts transfer across text, ASCII, and SVG:
October 24, 2025 at 9:34 PM