Lightnews — Scholar-powered news

Bo Liu (Benjamin Liu)

@benjamin-eecs.bsky.social

110 followers 21 following 8 posts

Reinforcement Learning PhD @NUSingapore | Undergrad @PKU1898 | Building autonomous decision making systems | Ex intern @MSFTResearch @deepseek_ai | DeepSeek-V2, DeepSeek-VL, DeepSeek-Prover

Posts Media Videos Starter Packs

Pinned

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

We're excited about self-play unlocking continuously improving agents. RL selects CoT patterns from LLMs. Games=perfect testing grounds.
SPIRAL: models learn via self-competition. Kuhn Poker → +8.7% math, +18.1 Minerva Math! 🃏
Paper: huggingface.co/papers/2506....
Code: github.com/spiral-rl/spiral

2 5 17

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

Co-first authors: @LeonGuertler @simon_ycl @zzlccc, advisor @natashajaques.bsky.social
Team: @QPHutu @danibalcells @mickel_liu C.Tan @shi_weiyan @mavenlin W.S.Lee
@NUSingapore @ASTARsg @Northeastern @UW 🚀

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

New paradigm: instead of curating problems, create environments where models discover reasoning through competition.
Self-play = autonomous improvement without human supervision. Simple games improve general reasoning!

1 1 5

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

We developed Role-conditioned Advantage Estimation (RAE) to stabilize training.
Without RAE: "thinking collapse" - responses crash 3500→0 chars, math drops 66%
RAE keeps reasoning alive!

1 1

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

Multi-game magic:
Single game: ~41% reasoning average
Multi-game: 42.7% - skills synergize!
Even strong models improve:
DeepSeek-R1-Distill-Qwen-7B jumps 59.7%→61.7%. AIME'25 +10 points! 📈

1 2

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

Different games → different skills:
TicTacToe → spatial (56% on Snake)
Kuhn Poker → probabilistic (91.7% on Pig Dice!)
Simple Negotiation → strategic (55.8% on Truth & Deception)
Each game develops distinct abilities!

1 2

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

Why self-play? We compared approaches:
Self-play: 39.7% math, 47.8% general reasoning
Fixed opponents: Much worse
Random: Complete collapse
Key: as you improve, so does your opponent. Fixed opponents become too easy.

1 3

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

To understand poker→math transfer, we found 3 patterns:
📊 Expected Value Calculation
🔍 Case-by-Case Analysis
🎯 Pattern Recognition
These patterns from games transfer to math benchmarks. Games teach generalizable thinking!

1 2

Bo Liu (Benjamin Liu) @benjamin-eecs.bsky.social · Jul 1

2 5 17

Reposted by Bo Liu (Benjamin Liu)

Ksenia Se / Turing Post @turingpost.bsky.social · Nov 29

Natural Language Reinforcement Learning (NLRL) redefines Reinforcement Learning (RL).

NLRL's main idea:
The core parts of RL like goals, strategies, and evaluation methods are reimagined using natural language instead of rigid math.

Let's explore this approach more precisely🧵

1 1 2