Gokul Swamy
@gokul.dev
4.2K followers 420 following 110 posts
final year PhD student at @cmurobotics.bsky.social working on efficient algorithms for interactive learning (e.g. imitation / RL / RLHF). no model is an island. prefers email. https://gokul.dev/.
Posts Media Videos Starter Packs
Pinned
gokul.dev
I was lucky enough to be invited give a talk on our new paper on the value of RL in fine-tuning at Cornell last week! Because of my poor time management skills, the talk isn't as polished as I'd like, but I think the "vibes" are accurate enough to share: youtu.be/E4b3cSirpsg.
gokul.dev
Late, but arxiv.org/abs/0804.2996 is *incredible*, so many good lines (e.g., "This comes close to being an accusation of a false claim of priority for a false discovery of an untrue fact, which would be a rare triple-negative in the history of intellectual property disputes.").
The Epic Story of Maximum Likelihood
At a superficial level, the idea of maximum likelihood must be prehistoric: early hunters and gatherers may not have used the words ``method of maximum likelihood'' to describe their choice of where a...
arxiv.org
gokul.dev
I've been really enjoying the new Ninajirachi album -- it's very Boiler Room-core :)
gokul.dev
Thanks for the shout-out and I hope the lectures were at least somewhat understandable! Yeah, once things settle down a bit for me, I'd like to more deeply understand the connection between Rust's structural estimation and IRL as I conceive of it.
gokul.dev
We therefore advocate for caution when making or evaluating claims about LLM reasoning and beyond with GRPO and PPO, ideally using algorithms like RLoo or REBEL instead. Check out our blog post for links to our code and W&B logs if you'd like to reproduce our experiments.
gokul.dev
While this worked out for the better on some seeds, it doesn't have to in general. After all, an algorithm that behaves unexpectedly *well* in one setting can perform unexpectedly *poorly* in another, perhaps more important, setting.
gokul.dev
We see similar results on a didactic bandit problem -- i.e. a problem that has nothing to do with LLMs or reasoning! This implies that PPO / GRPO are fundamentally *not* following the true policy gradient.
gokul.dev
We find that RLoo (an unbiased estimate of the vanilla PG) and REBEL (a regression-based approximation of online mirror descent) preserve performance as expected. In contrast, algorithms like PPO / GRPO that include heuristics (e.g. clipping) show a marked and unexpected change in performance.
gokul.dev
So, with a truly random reward function, all policies look equally good. Thus, the *true* policy gradient is zero, as the initial policy is optimal by construction. So, we'd expect performance to flatline. We use random rewards as a *diagnostic task* to compare different RL algs.
gokul.dev
Lead by Owen Oertell & Wenhao Zhan, joint w/ Steven Wu, Kiante Brantley, Jason Lee, and Wen Sun. If a project has got Wen, Owen, Wenhao, and Qwen on it, you know it's gotta be good 😛.
gokul.dev
Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-c...
Heuristics Considered Harmful: RL With Random Rewards Should Not Make LLMs Reason | Notion
Owen Oertell*, Wenhao Zhao*, Gokul Swamy, Zhiwei Steven Wu, Kiante Brantley, Jason Lee, Wen Sun
tinyurl.com
Reposted by Gokul Swamy
sacha2.bsky.social
very nice lectures, watch them from time to time
gokul.dev
It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public!
Reposted by Gokul Swamy
sharky6000.bsky.social
Want to learn about online learning, game solving, RL, imitation learning with applications to robotics, and RLHF with applications to language modeling? Check out this course! 👍
gokul.dev
It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public!
gokul.dev
While I can't promise everything will be crystal-clear after going though the lectures (especially because of my handwriting :p), I hope that if nothing else, you can tell how beautiful we all find these ideas. If that feeling comes across, I'll feel like I have succeeded! :)
gokul.dev
The second was being able to teach this course with my amazing advisors, Drew Bagnell and Steven Wu -- the folks I learned all of this stuff from. Fun fact: because of parking fees, Drew actually *paid* to lecture. And I'm always grateful to ZSW for pushing me out of the nest.
gokul.dev
Two other things made this course particularly special. The first was the students and their *incredible* questions -- there were so many times where I was like wow, it took me *YEARS* before I realized that was the right question to be asking.
gokul.dev
We also had wonderful guest lectures from Yuda Song
on hybrid RL (youtu.be/1B2XGXQ2hfA), Sanjiban Choudhury on scaling imitation (youtu.be/KnXSeTuCgFI), and Wen Sun on RLHF algorithms (youtu.be/qdkBZJywi_4).
Algorithmic Foundations of Interactive Learning SP25: Lecture 17
YouTube video by Gokul Swamy
youtu.be
gokul.dev
My favorite lectures to give were on the value of interaction in imitation / RLHF! youtu.be/uESAXg-CXFs, youtu.be/N8-Nh_iTmps, youtu.be/qHvB30J5gyo, youtu.be/ZzFjoH47GIg. It took 5 years, but I finally have an answer at least I find compelling :p.
Algorithmic Foundations of Interactive Learning SP25: Lecture 19
YouTube video by Gokul Swamy
youtu.be
gokul.dev
To do so, we worked backwards from things like ChatGPT and RMA and "backed out" a "dependency graph". We then did a "forward pass" over the semester, going from online learning, to game solving, to core RL, to imitation learning / robot learning, to RLHF / LLM fine-tuning.
gokul.dev
I think in a field as fast-paced as machine learning, a good course gives students a conceptual framework for understanding new developments quickly + what is actually "new" vs. the classical algorithms. We also wanted to explain *when* scale isn't "all you need."
gokul.dev
You can access all the content here:
Course Website: interactive-learning-algos.github.io
Lecture Playlist: youtube.com/playlist?lis...
Scribe Notes "Book": interactive-learning-algos.github.io/assets/pdfs/....
Homeworks / class competition material are also public!
Home
Website for AFIL course.
interactive-learning-algos.github.io
gokul.dev
It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public!
gokul.dev
Shortcut models enable scaling offline RL, both at train-time at test-time! We beat so many other algorithms on so many tasks we had to stick most of the results in the appendix 😅. Very proud of @nico-espinosa-dice.bsky.social for spearheading this project, check out his thread!
nico-espinosa-dice.bsky.social
by incorporating self-consistency during offline RL training, we unlock three orthogonal directions of scaling:

1. efficient training (i.e. limit backprop through time)
2. expressive model classes (e.g. flow matching)
3. inference-time scaling (sequential and parallel)
gokul.dev
Boston friends: I'll be in the Cambridge area for the next few days, shoot me a message if you'd like to catch up :).
gokul.dev
I won't be at #ICLR2025 myself this time around but please go talk to lead authors Nico, Zhaolin, and Runzhe about their bleeding-edge algorithms for imitation learning and RLHF!