Leshem (Legend) Choshen @ICML @ACL
@lchoshen.bsky.social
3K followers 770 following 860 posts
🥇 LLMs together (co-created model merging, BabyLM, textArena.ai) 🥈 Spreading science over hype in #ML & #NLP Proud shareLM💬 Donor @IBMResearch & @MIT_CSAIL
Posts Media Videos Starter Packs
Reposted by Leshem (Legend) Choshen @ICML @ACL
interplay-workshop.bsky.social
✨ The schedule for our INTERPLAY workshop at COLM is live! ✨
🗓️ October 10th, Room 518C
🔹 Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
🔹 Paper presentations and posters
🔹 Closing roundtable discussion.

Join us in Montréal! @colmweb.org
Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C.

09:00 am: Opening
09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt
10:20 am: Paper Presentations

Lunch Break

01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald
02:10 pm: Poster Session
03:20 pm: Roundtable Discussion
04:50 pm: Closing
lchoshen.bsky.social
Indeed, look at how it is encoded, you decode it in just the same way. (They did have lossy video or audio version I believe but didn't look into the details)
lchoshen.bsky.social
One thing I couldn't find is speed. Compression ratio is usually a speed vs. performance issue. This seems like a massively slow process, so wonder (even with SLMs) when it is justified, and how well other methods would do given so much more compute.
lchoshen.bsky.social
They did it for images, video, text and it all compresses really, really well.
lchoshen.bsky.social
So we get on average short numbers to represent sentences. and to decode them we run the model again, get probabilities and decide with those what next word to give to the model.
lchoshen.bsky.social
Then we have different probabilities splitting 0.5-0.55 more, and perhaps setting every starting "I am..." to 0.512-0.5124.
The probabilities you get from you favorite LLM.
The result is a single number for your sentence. And if the model was good, this number will be short.
lchoshen.bsky.social
Paper: alphaxiv.org/pdf/2407.07723
Arithmetic coding works by sequentially cutting the number space between 0-1.
Consider compressing "I am".
The model says the probability to start a sentence with the token "a" is 30% "the" 20% "I" 5% ...
So any sentence I + ... Falls in 0.50-0.55
Lossless data compression by large models | alphaXiv
View 2 comments: What about speed? What uses do people imagine this can really have? (maybe something with cold storage?)
alphaxiv.org
lchoshen.bsky.social
LLM, VLMs, ... can compress data
3x over JPEG\PNG etc.
6x Zlib, gzip etc.
How?
We all know they provide a probability over data, which is all classical compression needs
(arithmetic coding, see below)
Understanding is compressing, but this time not by the weights themselves
🤖📈🧠
#AI #compress #data
lchoshen.bsky.social
They also make it hard to improve and get feedback
Reposted by Leshem (Legend) Choshen @ICML @ACL
tpimentel.bsky.social
LLMs are trained to mimic a “true” distribution—their reducing cross-entropy then confirms they get closer to this target while training. Do similar models approach this target distribution in similar ways, though? 🤔 Not really! Our new paper studies this, finding 4-convergence phases in training 🧵
Figure showing the four phases of convergence in LM training
lchoshen.bsky.social
I wish I had this guy's chats to research and compare his claims to alternative uses. (E. G. Through shareLM)
lchoshen.bsky.social
The paper's authors are also tagged in this thread so maybe they know more
lchoshen.bsky.social
One paper also find that cross-linguality is hard across scripts (Replication is always good bsky.app/profile/lcho... )
and they tend to become more cross lingual with training.
lchoshen.bsky.social
Thus, a "feature" is defined by the sparse activations we find.
And these are shifting quite rapidly at a certain part in training
lchoshen.bsky.social
How can we do it
So crosscoders map activations into a sparse representations and to decode those back into the activations (classic compress decompress).
A single crosscoder is then trained to map activations of all pretrain checkpoints, creating a shared space
lchoshen.bsky.social
Employing mechanistic interpretability to study how models learn, not just where they end up
2 papers find:
There are phase transitions where features emerge and stay throughout learning
🤖📈🧠
alphaxiv.org/pdf/2509.17196
@amuuueller.bsky.social @abosselut.bsky.social
alphaxiv.org/abs/2509.05291
lchoshen.bsky.social
We also hope that attentive readers recognize our section titles are organized as a step-by-step plan!
lchoshen.bsky.social
Who says science can't be fun?
Our title “A Good Plan is Hard to Find” is a reference to Flannery O’Connor’s short story “A Good Man is Hard to Find” [...] we encourage you to find all six references
lchoshen.bsky.social
And yet again, we see that preference is biased towards things other than speed and performance. Yes, we are not machines, and we like short plans... we don't run in 5GHz either
And importantly:
lchoshen.bsky.social
They found that it is really hard to predict what is helpful (I wonder if it is because helpful itself is quite noisy, how predictable is it in general? with the best information?)
But also that plans, even bad ones help LLMs' and humans performance (but slow them down)
lchoshen.bsky.social
Then all they needed to do was run and run and run. They compared plans with how well models succeed when using them, how preferable people found them and the reward 6 reward models gave them.
lchoshen.bsky.social
The authors tasked many people with solving complicated questions based on information from step by step plans. And checked which plan helps more taking into account if it helped strong solvers (with IRT).

arxiv.org/abs/2509.18632
@nbalepur.bsky.social
lchoshen.bsky.social
Helpfulness is what we are after, and we test it by asking humans for preferences, or reward models.
and they fail😆

They show that humans are bad at predicting what is helpful, so are reward models (all close to chance).
Reward models don't even predict what helps LLMs
RL🤔
🤖📈🧠
#AI #LLM
lchoshen.bsky.social
And a random tip if you just stayed here:
Take the reader on a journey of understanding,
not the journey you made to understand.