jonhue.github.io
Also check our work applying the same algorithm to offline data: self-distillation.github.io/SDFT
Here the baseline is SFT, not GRPO.
We show: Self-distillation enables continual learning.
Also check our work applying the same algorithm to offline data: self-distillation.github.io/SDFT
Here the baseline is SFT, not GRPO.
We show: Self-distillation enables continual learning.
(n/n)
(n/n)
SDPO allows learning even before seeing any reward! Simply by sequentially fixing "errors" as the model encounters them.
(7/n)
SDPO allows learning even before seeing any reward! Simply by sequentially fixing "errors" as the model encounters them.
(7/n)
In other words: Better models → better retrospection in SDPO → better models
(6/n)
In other words: Better models → better retrospection in SDPO → better models
(6/n)
SDPO demonstrates that effective reasoning does not have to be verbose!
How? The self-teacher penalizes useless tokens.
(5/n)
SDPO demonstrates that effective reasoning does not have to be verbose!
How? The self-teacher penalizes useless tokens.
(5/n)
SDPO achieves GRPOs 5h accuracy in 30min wall-clock time and SDPO converges to 20%pts higher accuracy.
Also, SDPO learns more concise reasoning (see below).
(4/n)
SDPO achieves GRPOs 5h accuracy in 30min wall-clock time and SDPO converges to 20%pts higher accuracy.
Also, SDPO learns more concise reasoning (see below).
(4/n)
This leads to interesting patterns of advantages 👇
(3/n)
This leads to interesting patterns of advantages 👇
(3/n)
Key insight: Putting environment feedback (like runtime errors) and successful attempts in-context, turning the model into its own teacher.
Bonus: Virtually same runtime as GRPO!
(2/n)
Key insight: Putting environment feedback (like runtime errors) and successful attempts in-context, turning the model into its own teacher.
Bonus: Virtually same runtime as GRPO!
(2/n)
Joint work with the amazing @marbaga.bsky.social, @gmartius.bsky.social, @arkrause.bsky.social
Joint work with the amazing @marbaga.bsky.social, @gmartius.bsky.social, @arkrause.bsky.social
Interestingly, this solution is viable even without any information about pre-training!
Interestingly, this solution is viable even without any information about pre-training!
If you're interested in test-time training or active learning, come chat with me at our poster session!
If you're interested in test-time training or active learning, come chat with me at our poster session!