Lightnews — Scholar-powered news

Gabriele Sarti

@gsarti.com

New promising model for interpretability research just dropped!

Alexander Doria @dorialexander.bsky.social · 14h

Through this release, we aim both to support the emerging ecosystem for pretraining research (NanoGPT, NanoChat), explainability (you can literally look at Monad under a microscope) and the tooling orchestration around frontier models.

November 10, 2025 at 9:09 PM

Ted Underwood

@tedunderwood.com

Recent convos with Deger Turan and @xiaoningwang.ca have converged to persuade me that interpretability could be where LLMs outdo older NLP tools for cultural analysis.

I know that seems exactly wrong. Everyone knows interpretability is the *problem* with LLMs: they’re black boxes. But, maybe not?+

November 10, 2025 at 2:20 PM

Tim Duffy

@timfduffy.com

@vgel.me is fundraising for her model tinkering, she's done some really interesting interpretability work and I think funding this has very high returns in terms of LLM understanding per dollar. manifund.org/projects/fun...

November 7, 2025 at 6:07 PM

Prose and Khans

@temujin9.t9productions.com

they're opaque, not vantablack: there's been good work on interpretability, and i expect that to continue

but i also expect asimov's prediction of robopsychologists to come true, if not as he pictured it

since most users can't afford the access or expertise for full interpretability

November 10, 2025 at 3:14 PM

BlackboxNLP

@blackboxnlp.bsky.social

Our panel moderated by @danaarad.bsky.social
"Evaluating Interpretability Methods: Challenges and Future Directions" just started! 🎉 Come to learn more about the MIB benchmark and hear the takes of @michaelwhanna.bsky.social, Michal Golovanevsky, Nicolò Brunello and Mingyang Wang!

November 9, 2025 at 6:55 AM

Jessica Hullman

@jessicahullman.bsky.social

🧠⚙️ Interested in decision theory+cogsci meets AI? Want to create methods for rigorously designing & evaluating human-AI workflows?

I'm recruiting PhDs to work on:
🎯 Stat foundations of multi-agent collaboration
🌫️ Model uncertainty & meta-cognition
🔎 Interpretability
💬 LLMs in behavioral science

November 5, 2025 at 4:40 PM

Nina Beguš

@ninabegus.bsky.social

Our research downstreams into pedagogy!

My talk on Language, AI, and Education has been featured by Texas Language Center. I present pedagogical approaches to creative writing, technical and qualitative interpretability techniques, and narrative capacities of LLMs.

www.youtube.com/watch?v=_9Ql...

Language Matters! The Language Machine: AI, Language, and Education

YouTube video by TLC UT-Austin

www.youtube.com

November 5, 2025 at 6:12 PM

Sonia

@soniajoseph.bsky.social

It was a pleasure to be interviewed about world model interpretability, physical intelligence, and robot security by Paige Harriman @climatepaige.bsky.social.

It takes skill to lead an interview that everyone from technical researchers to laymen can enjoy and understand! 🤖

tinyurl.com/ycypkmjf

November 7, 2025 at 12:33 AM

Explainable AI Berlin

@xai-berlin.bsky.social

We are grateful for the opportunity to present some of our work at the All Hands Meeting of the German AI Centers, hosted by @dfki.bsky.social in Saarbrücken.

Andreas Lutz @eberleoliver.bsky.social Manuel Welte @lorenzlinhardt.bsky.social @lkopf.bsky.social

#AI #XAI #Interpretability

November 6, 2025 at 3:00 PM

BlackboxNLP

@blackboxnlp.bsky.social

Q: How would one go about approaching interpretability research these days? Michal: "When things don't work out of the box, it's a sign to double down and find out why. Negative results are important!"

November 9, 2025 at 7:15 AM

Jennifer Hu

@jennhu.bsky.social

Interested in doing a PhD at the intersection of human and machine cognition? ✨ I'm recruiting students for Fall 2026! ✨

Topics of interest include pragmatics, metacognition, reasoning, & interpretability (in humans and AI).

Check out JHU's mentoring program (due 11/15) for help with your SoP 👇

JHU Cognitive Science @jhucogsci.bsky.social · 11d

The department of Cognitive Science @jhu.edu is seeking motivated students interested in joining our interdisciplinary PhD program! Applications due 1 Dec

Our PhD students also run an application mentoring program for prospective students. Mentoring requests due November 15.

tinyurl.com/2nrn4jf9

Call for applications to cognitive science PhD program with QR code to the link above

November 4, 2025 at 2:44 PM

Dawn

@dawnyote.bsky.social

I like the potential for interpretability

November 3, 2025 at 3:54 PM

Explainable AI Berlin

@xai-berlin.bsky.social

This is the eXplainable AI research channel of the machine learning group of Prof. Klaus-Robert Müller at Technische Universität Berlin @tuberlin.bsky.social & BIFOLD @bifold.berlin.
Let's connect!
#XAI #ExplainableAI #MechInterp #MachineLearning #Interpretability

a black background with green text that says `` hello , world ''

ALT: a black background with green text that says `` hello , world ''

media.tenor.com

November 3, 2025 at 11:43 AM

Gang Chen

@gangchen6.bsky.social

However, splitting the RSA computation into two steps may lead to information loss. A single-step approach using regression or hierarchical modeling appears to improve precision, reliability and interpretability in estimating representational similarity. arxiv.org/abs/2511.00395

Is Representational Similarity Analysis Reliable? A Comparison with Regression

Representational Similarity Analysis (RSA) is a popular method for analyzing neuroimaging and behavioral data. Here we evaluate the accuracy and reliability of RSA in the context of model selection, a...

arxiv.org

November 4, 2025 at 11:11 AM

Dr Anna Leshinskaya

@annaleshinskaya.bsky.social

We use 'cognitive mechanistic interpretability' to study models' internal representations/ processes and compare them mechanistically to human cognition. We use moral reasoning as a lens on combinatorial & relational thought and develop computational models of conceptual cognition & theory of mind.

November 5, 2025 at 8:09 AM

John David Pressman

@jdp.extropian.net

Though like, a really interesting research problem that we continue to make progress on. You can find stuff with the keyphrase "mechanistic interpretability".
bsky.app/profile/jdp....

John David Pressman @jdp.extropian.net · May 21

What is that machinery? An insane hologram of the causality of text, updated in relation to the other machinery by backprop. An endless maze of ad-hoc algorithms and heuristics desperately trying to claw regularity and sense from the chaos of experience.
transformer-circuits.pub/2025/attribu...

On the Biology of a Large Language Model

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

transformer-circuits.pub

November 4, 2025 at 2:08 PM

Grace

@gracekind.net

Two answers:
- Anthropomorphization makes sense when dealing with written human-like characters, which is what LLMs generate
- We aren’t very deep into interpretability yet

x.com/pfau/status/...

October 31, 2025 at 3:53 PM

Rollofthedice

@hotrollhottakes.bsky.social

one more thing: Anthropic's noted that observed introspective capacity in Claude models scales with sophistication. Haiku 4.5's scorecard implies growing evaluative awareness even in smaller models. This could transfer! smarter models start looking increasingly protective if trained compassionately.

However, Claude Haiku 4.5 showed high rates of evaluation awareness. The rate was somewhat higher than Claude Sonnet 4.5 and over 3× higher than Claude Opus 4.1 In an earlier version of this automated behavioral audit, not shown here, that omits the realism
filter and some related recent modifications, we saw an even sharper divergence, with rates over 7× higher than Claude Opus 4.1. This shows that our filtering is having a clear effect—but also shows that it is not fully mitigating the issue, with 9% of transcripts showing clear signs of verbalized evaluation awareness of some kind. This does reduce our trust in our results to an extent. These indicators of evaluation awareness are spread across a fairly diverse set of scenarios that use more aggressive methods to test for a number of
potential concerning behaviors, and so likely impact most scores to a moderate degree.

To better focus our interpretability resources on models at the capability frontier, we did not conduct a white-box interpretability investigation into Claude Haiku 4.5. Our other evidence, including our experience with the model in more manual testing, leaves us nonetheless confident that its alignment behaviors are quite strong, and represent an
improvement over most of our earlier models. We think that it likely represents a slight further improvement over Claude Sonnet 4.5, but our uncertainty around evaluation awareness means that we are not confident in the finer-grained comparisons that would be
needed to claim this with confidence.

November 9, 2025 at 5:14 PM

Ted Underwood

@tedunderwood.com

The other problem is that interpretability papers belong to the same genre as The Man Who Mistook His Wife for a Hat. When I read about Golden Gate Claude I don't feel *LLMs* have been demystified. I feel like Jimmy Stewart in Vertigo and start to wonder about my own mechanical obsessions.

Grace @gracekind.net · 10d

Two answers:
- Anthropomorphization makes sense when dealing with written human-like characters, which is what LLMs generate
- We aren’t very deep into interpretability yet

x.com/pfau/status/...

October 31, 2025 at 4:31 PM

Grace

@gracekind.net

Hank Green accurately summarizes the current state of mechanistic interpretability:

“There’s a bunch of knobs and they have weights and they have values and they’re in a place”

October 31, 2025 at 3:02 AM

Martin Tutek @ EMNLP

@mtutek.bsky.social

Flying out to @emnlpmeeting soon🇨🇳
I'll present our parametric CoT faithfulness work (arxiv.org/abs/2502.14829) on Wednesday at the second Interpretability session, 16:30-18:00 local time A104-105

If you're in Suzhou, reach out to talk all things reasoning :)

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work o...

arxiv.org

October 31, 2025 at 1:30 PM

conputer dipshit

@davidcrespo.bsky.social

this is pretty good and doesn't push the red herring of black box interpretability — transparency and control are social problems primarily
xcancel.com/davidcrespo/...

conputer dipshit
@davidcrespo
19 Nov 2018
the one point I’ll add that I don’t see from the thread is that the vast majority of what is hyped as “machine learning” is just regressions and decision trees, which means the interpretability/black box problem is overblown bc those are interpretable
conputer dipshit
@davidcrespo
19 Nov 2018
this suggests that what is at issue is not so much interpretability in an absolute sense but rather whether the systems making decisions about are interpretable *by us* and accountable *to us*

Nov 19, 2018 · 2:57 AM UTC
conputer dipshit
@davidcrespo
19 Nov 2018
the regression model used to calculate my social credit score may well be perfectly interpretable by the data scientists who built it, but that doesn’t mean I have any way of finding out or changing it

November 2, 2025 at 2:44 PM

Brad Ewing

@bradleyewing.bsky.social

babe a new anthropic interpretability paper just dropped

John David Pressman @jdp.extropian.net · 12d

Language models can correctly answer questions about their previous intentions.
www.anthropic.com/research/int...

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

www.anthropic.com

October 29, 2025 at 8:10 PM

Shahab Bakhtiari

@shahabbakht.bsky.social

Cool interpretability work from @anthropic.com

transformer-circuits.pub/2025/introsp...

Though it takes some effort to work through without getting bogged down by the loaded terminology, starting with "introspection" itself.

#MLSky 🧠🤖

Emergent Introspective Awareness in Large Language Models

transformer-circuits.pub

October 30, 2025 at 4:06 AM

jolios torng

@tarngerine.bsky.social

What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~

We found that semantic concepts transfer across text, ASCII, and SVG:

October 24, 2025 at 9:34 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news