Author | Lightnews

Byron Wallace

@byron.bsky.social

Check out @hibaahsan.bsky.social's paper on spotting (problematic) racial biases in LLMs for healthcare applications 👇

LLMs have been shown to provide different predictions in clinical tasks when patient race is altered. Can SAEs spot this undue reliance on race? 🧵

Work w/ @byron.bsky.social

Link: arxiv.org/abs/2511.00177

November 5, 2025 at 3:52 PM

Reposted by Byron Wallace

Ai2

@ai2.bsky.social

3/ 🏥 A separate team at Northeastern located where certain signals live inside Olmo and made targeted edits that reduced biased clinical predictions. This kind of audit is only possible because Olmo exposes all its components.
→ buff.ly/HkChr4Q

October 24, 2025 at 6:36 PM

Byron Wallace

@byron.bsky.social

Chantal (and Vinith) find that you can jailbreak LLMs with syntax! Some examples: cshaib.github.io/syntax_domai...

Chantal @chantalsh.bsky.social · Oct 24

Syntax that spuriously correlates with safe domains can jailbreak LLMs - e.g. below with GPT4o mini

Our paper (co w/ Vinith Suriyakumar) on syntax-domain spurious correlations will appear at #NeurIPS2025 as a ✨spotlight!

+ @marzyehghassemi.bsky.social, @byron.bsky.social, Levent Sagun

October 24, 2025 at 4:26 PM

Byron Wallace

@byron.bsky.social

Now to appear at #EMNLP2025 (Findings). We've added more models and experiments: arxiv.org/abs/2502.13319

Hiba Ahsan @hibaahsan.bsky.social · Feb 22

LLMs are known to perpetuate social biases in clinical tasks. Can we locate and intervene upon LLM activations that encode patient demographics like gender and race? 🧵

Work w/ @arnabsensharma.bsky.social, @silvioamir.bsky.social, @davidbau.bsky.social, @byron.bsky.social

arxiv.org/abs/2502.13319

October 22, 2025 at 12:24 PM

Byron Wallace

@byron.bsky.social

Can we distill *circuits* from teacher models into smaller students? 👇

Somin W @sominw.bsky.social · Sep 30

🔊 New work w/ @silvioamir.bsky.social & @byron.bsky.social! We show you can distill a model’s mechanism, not just its answers -- teaching a small LM to run it's circuit same as a larger teacher model. We call it Circuit Distillation. (1/4)

September 30, 2025 at 11:34 PM

Reposted by Byron Wallace

David Bau

@davidbau.bsky.social

Who is going to be at #COLM2025?

I want to draw your attention to a COLM paper by my student @sfeucht.bsky.social that has totally changed the way I think and teach about LLM representations. The work is worth knowing.

And you can meet Sheridan at COLM, Oct 7!
bsky.app/profile/sfe...

September 27, 2025 at 8:54 PM

Byron Wallace

@byron.bsky.social

Can we quantify what makes some text read like AI "slop"? We tried 👇

Chantal @chantalsh.bsky.social · Sep 24

"AI slop" seems to be everywhere, but what exactly makes text feel like "slop"?

In our new work (w/ @tuhinchakr.bsky.social, Diego Garcia-Olano, @byron.bsky.social ) we provide a systematic attempt at measuring AI "slop" in text!

arxiv.org/abs/2509.19163

🧵 (1/7)

September 24, 2025 at 1:28 PM

Reposted by Byron Wallace

Naomi Saphra

@nsaphra.bsky.social

Our new paper asks: what is the goal of “natural language verbalization” interpretability approaches? If a verbalizer is supposed to tell us something about what’s in the target LM and NOT just what’s in the verbalizer LM, how do we actually evaluate that?

Millicent Li @millicentli.bsky.social · Sep 17

In short: Verbalizer evals are broken! To know what info a model REMOVES from input, reconstruction is better than verbalization. And verbalization tells very little about what a model ADDS to input! w/A. Ceballos, G. Rogers, @nsaphra.bsky.social @byron.bsky.social

8/8

Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target ...

arxiv.org

September 17, 2025 at 9:45 PM

Reposted by Byron Wallace

Millicent Li

@millicentli.bsky.social

Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/8

September 17, 2025 at 7:19 PM

Reposted by Byron Wallace

Hye Sun Yun

@hyesunyun.bsky.social

Thrilled to share our research showing how LLM models can be influenced by bias from "spun" medical literature is now featured in Northeastern's Khoury news! This shows critical insights as AI enters healthcare.
The full paper can be found at arxiv.org/abs/2502.07963

As AI expands into medicine, Northeastern study finds AI models influenced by medical bias - Khoury College of Computer Sciences

Humans can be easily influenced by language that is one-sided, especially in complex fields like medicine. But a new Khoury-led study shows that large language models, too, can be tricked […]

khoury.northeastern.edu

August 25, 2025 at 3:36 PM

Reposted by Byron Wallace

David Bau

@davidbau.bsky.social

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

New England Mechanistic Interpretability Workshop

About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround...

www.youtube.com

August 18, 2025 at 6:06 PM

Reposted by Byron Wallace

Monica M Reddy

@monicamreddy.bsky.social

📢 How factual are LLMs in healthcare?
We’re excited to release FactEHR — a new benchmark to evaluate factuality in clinical notes. As generative AI enters the clinic, we need rigorous, source-grounded tools to measure what these models get right — and what they don’t. 🏥 🤖

August 11, 2025 at 5:25 PM

Reposted by Byron Wallace

Yuval Pinter

@uvp.bsky.social

Chatted with @byron.bsky.social at icml about my recent work, so look out for his upcoming "Tokenization is More Than More Than Compression".

July 19, 2025 at 9:11 PM

Reposted by Byron Wallace

Lily Chen

@lilywchen.bsky.social

Are we fact-checking medical claims the right way? 🩺🤔

Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems.

We show why—and argue fact-checking should be a dialogue, with patients in the loop

arxiv.org/abs/2506.20876

🧵1/

An overview of our AI-in-the-loop expert study pipeline: given a claim from a subreddit, we extract the PIO elements and retrieve the evidence automatically. The evidence, its context, and the evidence are then presented to a medical expert to provide a judgment and a rationale for the factuality of the claim.

July 1, 2025 at 5:10 PM

Reposted by Byron Wallace

Sheridan Feucht

@sfeucht.bsky.social

[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.

April 7, 2025 at 1:54 PM

Reposted by Byron Wallace

Chantal

@chantalsh.bsky.social

I'm searching for some comp/ling experts to provide a precise definition of “slop” as it refers to text (see: corp.oup.com/word-of-the-...)

I put together a google form that should take no longer than 10 minutes to complete: forms.gle/oWxsCScW3dJU...
If you can help, I'd appreciate your input! 🙏

Oxford Word of the Year 2024 - Oxford University Press

The Oxford Word of the Year 2024 is 'brain rot'. Discover more about the winner, our shortlist, and 20 years of words that reflect the world.

corp.oup.com

March 10, 2025 at 8:00 PM

Reposted by Byron Wallace

Jessy Li

@jessyjli.bsky.social

🌟Job ad🌟 We (@gregdnlp.bsky.social, @mattlease.bsky.social and I) are hiring a postdoc fellow within the CosmicAI Institute, to do galactic work with LLMs and generative AI! If you would like to push the frontiers of foundation models to help solve myths of the universe, please apply!

NSF-Simons AI Institute for Cosmic Origins (CosmicAI) @nsfsimonscosmicai.bsky.social · Feb 25

Seeking candidates (within three years of the award of their PhD) for a postdoctoral position with the Explorable Universe research group to perform research on developing next-generation generative AI copilots & agents to aid astronomy research. Info here www.cosmicai.org/jobs/postdoc...

February 25, 2025 at 10:09 PM

Reposted by Byron Wallace

Hiba Ahsan

@hibaahsan.bsky.social

LLMs are known to perpetuate social biases in clinical tasks. Can we locate and intervene upon LLM activations that encode patient demographics like gender and race? 🧵

Work w/ @arnabsensharma.bsky.social, @silvioamir.bsky.social, @davidbau.bsky.social, @byron.bsky.social

arxiv.org/abs/2502.13319

February 22, 2025 at 4:18 AM

Reposted by Byron Wallace

Hye Sun Yun

@hyesunyun.bsky.social

🚨 Do LLMs fall for spin in medical literature? 🤔

In our new preprint, we find that LLMs are susceptible to biased reporting of clinical treatment benefits in abstracts—more so than human experts. 📄🔍 [1/7]

Full Paper: arxiv.org/abs/2502.07963

🧵👇

February 15, 2025 at 2:34 AM

Reposted by Byron Wallace

Somin W

@sominw.bsky.social

📢 Can we trace a small distilled model back to its teacher? 🤔New work (w/ @chantalsh.bsky.social, @silvioamir.bsky.social & @byron.bsky.social) finds some footprints left by LLMs in distillation! [1/6]

🔗 Full paper: arxiv.org/abs/2502.06659

Who Taught You That? Tracing Teachers in Model Distillation

Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a stud...

arxiv.org

February 11, 2025 at 5:16 PM

Reposted by Byron Wallace

David Bau

@davidbau.bsky.social

DeepSeek R1 shows how important it is to be studying the internals of reasoning models. Try our code: Here @canrager.bsky.social shows a method for auditing AI bias by probing the internal monologue.

dsthoughts.baulab.info

I'd be interested in your thoughts.

dsthoughts.baulab

January 31, 2025 at 2:30 PM

Reposted by Byron Wallace

ijmarshall.bsky.social

@ijmarshall.bsky.social

📣 🌍 We're hiring for 2 Machine Learning researchers to join SOLACE-AI @kingscollegelondon.bsky.social , funded by @wellcometrust.bsky.social . This is your chance to develop cutting-edge AI to directly impact global health responses to climate emergencies. jobs.ac.uk/job/DLM377

January 27, 2025 at 11:55 AM

Reposted by Byron Wallace

Luca Soldaini 🎀

@soldaini.net

OLMo 2 is out 🥳 7B and 13B trained on 5T tokens, and meticulousy instruction tuned using Tulu 3 recipe.

Simply the best fully open models yet.

Really proud of the work & the amazing team at
@ai2.bsky.social

November 26, 2024 at 9:12 PM

Byron Wallace

@byron.bsky.social

And Sheridan Feucht investigates the "implicit vocabulary" of LLMs via token erasure: arxiv.org/abs/2406.20086 (w/David Atkinson and @davidbau.bsky.social)

November 9, 2024 at 9:21 PM

Byron Wallace

@byron.bsky.social

Somin Wadhwa has some intriguing findings on distillation with "chain of thought" sequences (e.g., this works better when "reasoning" follows labels, and individual tokens seem to be sufficient): arxiv.org/abs/2406.14511 (w/@Silvio Amir)

November 9, 2024 at 9:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news