Michael Noukhovitch @CoLM 2025🥯
@mnoukhov.bsky.social
230 followers 200 following 14 posts
PhD in AI @mila-quebec.bsky.social RLHF and language grounding, whatever that means. Whitespace aficianado. mnoukhov.github.io
Posts Media Videos Starter Packs
Pinned
mnoukhov.bsky.social
Our work on Asynchronous RLHF was accepted to #ICLR2025 ! (I was so excited to announce it, I forgot to say I was excited)

Used by @ai2.bsky.social for OLMo-2 32B 🔥
New results show ~70% speedups for LLM + RL math and reasoning 🧠

🧵below or hear my DLCT talk online on March 28!
Reposted by Michael Noukhovitch @CoLM 2025🥯
dvnxmvlhdf5.bsky.social
Preprint Alert 🚀

Multi-agent reinforcement learning (MARL) often assumes that agents know when other agents cooperate with them. But for humans, this isn’t always the case. For example, plains indigenous groups used to leave resources for others to use at effigies called Manitokan.
1/8
Manitokan are images set up where one can bring a gift or receive a gift. 1930s Rocky Boy Reservation, Montana, Montana State University photograph. Colourized with AI
mnoukhov.bsky.social
@dnllvy.bsky.social @oumarkaba.bsky.social presenting cool work at #ICLR2025 on generative models for crystals leveraging symmetry ❄️🪞, repping @mila-quebec.bsky.social
Reposted by Michael Noukhovitch @CoLM 2025🥯
saravera.bsky.social
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/
A circular diagram with a blue whale icon at the center. The diagram shows 8 interconnected research areas around LLM reasoning represented as colored rectangular boxes arranged in a circular pattern. The areas include: §3 Analysis of Reasoning Chains (central cloud), §4 Scaling of Thoughts (discussing thought length and performance metrics), §5 Long Context Evaluation (focusing on information recall), §6 Faithfulness to Context (examining question answering accuracy), §7 Safety Evaluation (assessing harmful content generation and jailbreak resistance), §8 Language & Culture (exploring moral reasoning and language effects), §9 Relation to Human Processing (comparing cognitive processes), §10 Visual Reasoning (covering ASCII generation capabilities), and §11 Following Token Budget (investigating direct prompting techniques). Arrows connect the sections in a clockwise flow, suggesting an iterative research methodology.
mnoukhov.bsky.social
Hope the Llama team releases more details. Until then check out my paper on async RLHF and feel free to message me to chat about it at ICLR!

bsky.app/profile/mnou...
mnoukhov.bsky.social
Our work on Asynchronous RLHF was accepted to #ICLR2025 ! (I was so excited to announce it, I forgot to say I was excited)

Used by @ai2.bsky.social for OLMo-2 32B 🔥
New results show ~70% speedups for LLM + RL math and reasoning 🧠

🧵below or hear my DLCT talk online on March 28!
mnoukhov.bsky.social
And to reviewer 2, I guess it does work in large scale distributed training! I am really curious how they did the resource balancing to account for different computational speed
mnoukhov.bsky.social
Llama 4 uses async RLHF and I would just like to announce that I called it t.co/w9qJxr944C
mnoukhov.bsky.social
Classic Benno, hanging out with his human friends John, Ṃ̵̢͍̬̘ͧ̉͆ͤ̈͆̂ä́t̢̢̡̫̻̰͈̣͚͆͛͗̈ͭ̉̕͟ͅt̛̹̰̑̓ͭ͗h̸̷̛̛̥̱͉͎̯̻̼͕͉̻̄̅̾ͣ̉̈͌̀ͮ͋ͯ͐ͮͥ̿͛ͪ͜͠͝ẹ̱̞̬̅͂ͯ̈́̆̎ͣw̵̨̧̧̥̩͔͎̬̭͚̩͉ͤ̌͢͝, and Cͧͯ_̸̨̱͙̦͍̉̒͐͐͂͋̎̂ͬ̑͜͝h͐_̮͒͢r̸̛̳̘̠̯ͣͧͦ̏͑ͯ͡i̷̡̡͔̪̟͙͖̫̩̭̳̤͕̞͙̯͚̫̯ͭͤ̌̽͋ͯ̉ͥ́ͭͧͥͦͬ̀ͨ͌̒͢͞s̺̹͛ͭ̐͗ͤͫ́̃ͤ͢͠
mnoukhov.bsky.social
Thanks again to my collaborators:
@vwxyzjn.bsky.social
@sophie-xhonneux.bsky.social
@arianh.bsky.social
Rishabh and Aaron who have not yet migrated 🦋

DMs open📲let's chat about about everything LLM + RL @ ICLR and check out
Paper 📰 arxiv.org/abs/2410.18252
Code 🧑‍💻 github.com/mnoukhov/asy...
mnoukhov.bsky.social
We also have an appendix full of fun details like "How to make RLOO work off-policy" and "Why synchronous RLHF is not feasible in the long term" from an engineering perspective 👷🛠️
Would love critiques from any engineers working on RLHF if they feel I missed something!
mnoukhov.bsky.social
We showed great results on RLHF but reviewers wanted reasoning + math 🧠🤔 Thanks my labmates Amirhossein and Milad, we got Rho-1B training on GSM8k!
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥
mnoukhov.bsky.social
Recap⌛️RL training of LLMs is frequently online and *on-policy* but training and generation alternate and idle while waiting for the other to finish.
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!
mnoukhov.bsky.social
Our work on Asynchronous RLHF was accepted to #ICLR2025 ! (I was so excited to announce it, I forgot to say I was excited)

Used by @ai2.bsky.social for OLMo-2 32B 🔥
New results show ~70% speedups for LLM + RL math and reasoning 🧠

🧵below or hear my DLCT talk online on March 28!
mnoukhov.bsky.social
Reminds me of a very similar shift towards open science by machine learning in 1999 (jmlr.org/statement.html). Nowadays we've got really great infrastructure in the form of @openreview.bsky.social! Reach out if you're considering shifting to open science and check out jmlr.org/tmlr/ for inspo :)
Transactions on Machine Learning Research
jmlr.org
mnoukhov.bsky.social
Programming using an AI assistant in order to improve AI assistants is giving me strong sci-fi vibes. Specifically Isaac Asimov, who clearly invented vibe coding in 1956 users.ece.cmu.edu/~gamvrosi/th...
mnoukhov.bsky.social
I'm at #NeurIPS2024 this week if anyone wants to talk about RLHF while drinking an overpriced (but excellent) pourover coffee or tea!
mnoukhov.bsky.social
It's actually necessary because bluesky is (now officially) federated and you're on a single instance called a PDS, and in this case bsky.social. Others exist (?) or will exist soon

A technical overview steveklabnik.com/writing/how-...
And a non-technical overview
www.theverge.com/24063290/fed...
How Does BlueSky Work?
steveklabnik.com