Jonas Hübotter
jonhue.bsky.social
Jonas Hübotter
@jonhue.bsky.social
PhD student at ETH Zurich
jonhue.github.io
We’re really excited about self-distillation as a new paradigm for post-training.

Also check our work applying the same algorithm to offline data: self-distillation.github.io/SDFT

Here the baseline is SFT, not GRPO.
We show: Self-distillation enables continual learning.
SDFT: Self-Distillation Enables Continual Learning
self-distillation.github.io
January 29, 2026 at 7:50 PM
Huge thanks to my amazing co-authors @rikelue.bsky.social, Lejs Behric, @antonbaumann.bsky.social, @marbaga.bsky.social, Daniel Marta, @idoh.bsky.social, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, @arkrause.bsky.social!!

(n/n)
January 29, 2026 at 7:44 PM
One of my favorite experiments in the paper was seeing that SDPO can discover novel solutions to hard binary-reward problems.

SDPO allows learning even before seeing any reward! Simply by sequentially fixing "errors" as the model encounters them.

(7/n)
January 29, 2026 at 7:41 PM
The key idea behind SDPO is to leverage a model's ability to learn in-context. We show that the gains of SDPO scale when scaling the base model.

In other words: Better models → better retrospection in SDPO → better models

(6/n)
January 29, 2026 at 7:41 PM
RLVR doesn't just lead to poor credit assignment, it learns reasoning that is inefficient! RLVR's learned reasoning style is verbose and often circular.

SDPO demonstrates that effective reasoning does not have to be verbose!
How? The self-teacher penalizes useless tokens.

(5/n)
January 29, 2026 at 7:40 PM
One of our results: We train Olmo3-7B-Instruct on a new task.

SDPO achieves GRPOs 5h accuracy in 30min wall-clock time and SDPO converges to 20%pts higher accuracy.

Also, SDPO learns more concise reasoning (see below).

(4/n)
January 29, 2026 at 7:40 PM
Why does this work? When conditioned on rich feedback, the model retrospectively evaluates its initial attempt. Anything that seems wrong in hindsight is discouraged. Anything that was good is encouraged.

This leads to interesting patterns of advantages 👇

(3/n)
January 29, 2026 at 7:40 PM
Introducing Self-Distillation Policy Optimization (SDPO).

Key insight: Putting environment feedback (like runtime errors) and successful attempts in-context, turning the model into its own teacher.

Bonus: Virtually same runtime as GRPO!

(2/n)
January 29, 2026 at 7:40 PM
July 14, 2025 at 7:38 PM
We propose an algorithm that does this by actively maximizing expected information gain of the demonstrations, with a couple of tricks to estimate this quantity and mitigate forgetting.
Interestingly, this solution is viable even without any information about pre-training!
July 14, 2025 at 7:35 PM
Our method significantly improves accuracy (measured as perplexity) for large language models and achieves a new state-of-the-art on the Pile benchmark.

If you're interested in test-time training or active learning, come chat with me at our poster session!
April 21, 2025 at 2:40 PM
We introduce SIFT, a novel data selection algorithm for test-time training of language models. Unlike traditional nearest neighbor methods, SIFT uses uncertainty estimates to select maximally informative data, balancing relevance & diversity.
April 21, 2025 at 2:40 PM
April 21, 2025 at 2:38 PM
Unfortunately not as of now. We may also release Jupyter notebooks in the future, but this may take some time.
February 12, 2025 at 10:25 PM
I'm glad you find this resource useful Maximilian!
February 11, 2025 at 3:26 PM
Noted. Thanks for the suggestion!
February 11, 2025 at 9:01 AM
Very glad to hear that they’ve been useful to you! :)
February 11, 2025 at 8:37 AM
table of contents:
February 11, 2025 at 8:35 AM
Huge thanks to the countless people that helped in the process of bringing this resource together!
February 11, 2025 at 8:20 AM