Here's the (more updated) NeurIPS version: proceedings.neurips.cc/paper_files/...
Also, more recently we extended the use of powerlaws for characterizing how representations change over (pre/post) training in LLMs. 🙂
🧵 here: bsky.app/profile/arna...
Here's the (more updated) NeurIPS version: proceedings.neurips.cc/paper_files/...
Also, more recently we extended the use of powerlaws for characterizing how representations change over (pre/post) training in LLMs. 🙂
🧵 here: bsky.app/profile/arna...
I have been asked this when talking about our work on using powerlaws to study representation quality in deep neural networks, glad to have a more concrete answer now! 😃
www.biorxiv.org/content/10.1...
I have been asked this when talking about our work on using powerlaws to study representation quality in deep neural networks, glad to have a more concrete answer now! 😃
His “epigenetic landscape” is a diagrammatic representation of the constraints influencing embryonic development.
On his 50th birthday, his colleagues gave him a pinball machine on the model of the epigenetic landscape.
🧪 🦫🦋 🌱🐋 #HistSTM #philsci #evobio
Funded by @ivado.bsky.social and in collaboration with the IVADO regroupement 1 (AI and Neuroscience: ivado.ca/en/regroupem...).
Interested? See the details in the comments. (1/3)
🧠🤖
Funded by @ivado.bsky.social and in collaboration with the IVADO regroupement 1 (AI and Neuroscience: ivado.ca/en/regroupem...).
Interested? See the details in the comments. (1/3)
🧠🤖
Previously, we show that neural representations for control of movement are largely distinct following supervised or reinforcement learning. The latter most closely matches NHP recordings.
We used a combination of neural recordings & modelling to show that RL yields neural dynamics closer to biology, with useful continual learning properties.
www.biorxiv.org/content/10.1...
Previously, we show that neural representations for control of movement are largely distinct following supervised or reinforcement learning. The latter most closely matches NHP recordings.
I feel spectral metrics can go a long way in unlocking LLM understanding+design. 🚀
I feel spectral metrics can go a long way in unlocking LLM understanding+design. 🚀
@natolambert.bsky.social + the OLMo team!
Paper 📝: arxiv.org/abs/2509.23024
👩💻 Code : Coming soon! 👨💻
@natolambert.bsky.social + the OLMo team!
Paper 📝: arxiv.org/abs/2509.23024
👩💻 Code : Coming soon! 👨💻
@melodylizx.bsky.social @kumarkagrawal.bsky.social Komal Teru @glajoie.bsky.social @adamsantoro.bsky.social @tyrellturing.bsky.social
at @mila-quebec.bsky.social @berkeleyair.bsky.social @cohere.com & @googleresearch.bsky.social!
🧵9/9
@melodylizx.bsky.social @kumarkagrawal.bsky.social Komal Teru @glajoie.bsky.social @adamsantoro.bsky.social @tyrellturing.bsky.social
at @mila-quebec.bsky.social @berkeleyair.bsky.social @cohere.com & @googleresearch.bsky.social!
🧵9/9
- Pretraining: Compress → Expand (Memorize) → Compress (Generalize).
- Post-training: SFT/DPO → Expand; RLVR → Consolidate.
Representation geometry offers insights into when models memorize vs. generalize! 🤓
🧵8/9
- Pretraining: Compress → Expand (Memorize) → Compress (Generalize).
- Post-training: SFT/DPO → Expand; RLVR → Consolidate.
Representation geometry offers insights into when models memorize vs. generalize! 🤓
🧵8/9
On SciQ:
- Removing top 10/50 directions barely hurts accuracy.✅
- Retaining only top 10/50 directions CRUSHES accuracy.📉
As supported by our theoretical results, eigenspectrum tail encodes critical task information! 🤯
🧵7/9
On SciQ:
- Removing top 10/50 directions barely hurts accuracy.✅
- Retaining only top 10/50 directions CRUSHES accuracy.📉
As supported by our theoretical results, eigenspectrum tail encodes critical task information! 🤯
🧵7/9
We show, both through theory and with simulations in a toy model, that these non-monotonic spectral changes occur due to gradient descent dynamics with cross-entropy loss under 2 conditions:
1. skewed token frequencies
2. representation bottlenecks
🧵6/9
We show, both through theory and with simulations in a toy model, that these non-monotonic spectral changes occur due to gradient descent dynamics with cross-entropy loss under 2 conditions:
1. skewed token frequencies
2. representation bottlenecks
🧵6/9
- SFT & DPO exhibit entropy-seeking expansion, favoring instruction memorization but reducing OOD robustness.📈
- RLVR exhibits compression-seeking consolidation, learning reward-aligned behaviors at the cost of reduced exploration.📉
🧵5/9
- SFT & DPO exhibit entropy-seeking expansion, favoring instruction memorization but reducing OOD robustness.📈
- RLVR exhibits compression-seeking consolidation, learning reward-aligned behaviors at the cost of reduced exploration.📉
🧵5/9
- Entropy-seeking: Correlates with short-sequence memorization (♾️-gram alignment).
- Compression-seeking: Correlates with dramatic gains in long-context factual reasoning, e.g. TriviaQA.
Curious about ♾️-grams?
See: bsky.app/profile/liuj...
🧵4/9
- Entropy-seeking: Correlates with short-sequence memorization (♾️-gram alignment).
- Compression-seeking: Correlates with dramatic gains in long-context factual reasoning, e.g. TriviaQA.
Curious about ♾️-grams?
See: bsky.app/profile/liuj...
🧵4/9
Warmup: Rapid compression, collapsing representation to dominant directions.
Entropy-seeking: Manifold expansion, adding info in non-dominant directions.📈
Compression-seeking: Anisotropic consolidation, selectively packing more info in dominant directions.📉
🧵3/9
Warmup: Rapid compression, collapsing representation to dominant directions.
Entropy-seeking: Manifold expansion, adding info in non-dominant directions.📈
Compression-seeking: Anisotropic consolidation, selectively packing more info in dominant directions.📉
🧵3/9
BUT
🎢The spectral metrics (RankMe, αReQ) change non-monotonically (with more pretraining)!
Takeaway: We discover geometric phases of LLM learning!
🧵2/9
BUT
🎢The spectral metrics (RankMe, αReQ) change non-monotonically (with more pretraining)!
Takeaway: We discover geometric phases of LLM learning!
🧵2/9
- Spectral Decay Rate, αReQ: Fraction of variance in non-dominant directions.
- RankMe: Effective Rank; #dims truly active.
⬇️αReQ ⇒ ⬆️RankMe ⇒ More complex!
🧵1/9
- Spectral Decay Rate, αReQ: Fraction of variance in non-dominant directions.
- RankMe: Effective Rank; #dims truly active.
⬇️αReQ ⇒ ⬆️RankMe ⇒ More complex!
🧵1/9
How does the complexity of this mapping change across LLM training? How does it relate to the model’s capabilities? 🤔
Announcing our #NeurIPS2025 📄 that dives into this.
🧵below
#AIResearch #MachineLearning #LLM
How does the complexity of this mapping change across LLM training? How does it relate to the model’s capabilities? 🤔
Announcing our #NeurIPS2025 📄 that dives into this.
🧵below
#AIResearch #MachineLearning #LLM
🚨 New preprint! 🚨
Excited and proud (& a little nervous 😅) to share our latest work on the importance of #theta-timescale spiking during #locomotion in #learning. If you care about how organisms learn, buckle up. 🧵👇
📄 www.biorxiv.org/content/10.1...
💻 code + data 🔗 below 🤩
#neuroskyence
This group got it working!
arxiv.org/abs/2506.17768
May be a great way to reduce AI energy use!!!
#MLSky 🧪
This group got it working!
arxiv.org/abs/2506.17768
May be a great way to reduce AI energy use!!!
#MLSky 🧪
Can't wait to read in detail.
It's a pleasure to share our paper at @cp-cell.bsky.social, showing how mice learning over long timescales display key hallmarks of gradient descent (GD).
The culmination of my PhD supervised by @laklab.bsky.social, @saxelab.bsky.social and Rafal Bogacz!
Can't wait to read in detail.
Multi-agent reinforcement learning (MARL) often assumes that agents know when other agents cooperate with them. But for humans, this isn’t always the case. For example, plains indigenous groups used to leave resources for others to use at effigies called Manitokan.
1/8