Bo Liu (Benjamin Liu)
@benjamin-eecs.bsky.social
110 followers 21 following 8 posts
Reinforcement Learning PhD @NUSingapore | Undergrad @PKU1898 | Building autonomous decision making systems | Ex intern @MSFTResearch @deepseek_ai | DeepSeek-V2, DeepSeek-VL, DeepSeek-Prover
Posts Media Videos Starter Packs
Pinned
benjamin-eecs.bsky.social
We're excited about self-play unlocking continuously improving agents. RL selects CoT patterns from LLMs. Games=perfect testing grounds.
SPIRAL: models learn via self-competition. Kuhn Poker → +8.7% math, +18.1 Minerva Math! 🃏
Paper: huggingface.co/papers/2506....
Code: github.com/spiral-rl/spiral
benjamin-eecs.bsky.social
Co-first authors: @LeonGuertler @simon_ycl @zzlccc, advisor @natashajaques.bsky.social
Team: @QPHutu @danibalcells @mickel_liu C.Tan @shi_weiyan @mavenlin W.S.Lee
@NUSingapore @ASTARsg @Northeastern @UW 🚀
benjamin-eecs.bsky.social
New paradigm: instead of curating problems, create environments where models discover reasoning through competition.
Self-play = autonomous improvement without human supervision. Simple games improve general reasoning!
benjamin-eecs.bsky.social
We developed Role-conditioned Advantage Estimation (RAE) to stabilize training.
Without RAE: "thinking collapse" - responses crash 3500→0 chars, math drops 66%
RAE keeps reasoning alive!
benjamin-eecs.bsky.social
Multi-game magic:
Single game: ~41% reasoning average
Multi-game: 42.7% - skills synergize!
Even strong models improve:
DeepSeek-R1-Distill-Qwen-7B jumps 59.7%→61.7%. AIME'25 +10 points! 📈
benjamin-eecs.bsky.social
Different games → different skills:
TicTacToe → spatial (56% on Snake)
Kuhn Poker → probabilistic (91.7% on Pig Dice!)
Simple Negotiation → strategic (55.8% on Truth & Deception)
Each game develops distinct abilities!
benjamin-eecs.bsky.social
Why self-play? We compared approaches:
Self-play: 39.7% math, 47.8% general reasoning
Fixed opponents: Much worse
Random: Complete collapse
Key: as you improve, so does your opponent. Fixed opponents become too easy.
benjamin-eecs.bsky.social
To understand poker→math transfer, we found 3 patterns:
📊 Expected Value Calculation
🔍 Case-by-Case Analysis
🎯 Pattern Recognition
These patterns from games transfer to math benchmarks. Games teach generalizable thinking!
benjamin-eecs.bsky.social
We're excited about self-play unlocking continuously improving agents. RL selects CoT patterns from LLMs. Games=perfect testing grounds.
SPIRAL: models learn via self-competition. Kuhn Poker → +8.7% math, +18.1 Minerva Math! 🃏
Paper: huggingface.co/papers/2506....
Code: github.com/spiral-rl/spiral
Reposted by Bo Liu (Benjamin Liu)
turingpost.bsky.social
Natural Language Reinforcement Learning (NLRL) redefines Reinforcement Learning (RL).

NLRL's main idea:
The core parts of RL like goals, strategies, and evaluation methods are reimagined using natural language instead of rigid math.

Let's explore this approach more precisely🧵