Lightnews — Scholar-powered news

Jonas Hübotter

@jonhue.bsky.social

We’re really excited about self-distillation as a new paradigm for post-training.

Also check our work applying the same algorithm to offline data: self-distillation.github.io/SDFT

Here the baseline is SFT, not GRPO.
We show: Self-distillation enables continual learning.

SDFT: Self-Distillation Enables Continual Learning

self-distillation.github.io

January 29, 2026 at 7:50 PM

Jonas Hübotter

@jonhue.bsky.social

Huge thanks to my amazing co-authors @rikelue.bsky.social, Lejs Behric, @antonbaumann.bsky.social, @marbaga.bsky.social, Daniel Marta, @idoh.bsky.social, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, @arkrause.bsky.social!!

(n/n)

January 29, 2026 at 7:44 PM

Jonas Hübotter

@jonhue.bsky.social

Paper: arxiv.org/abs/2601.20802

Website: self-distillation.github.io/SDPO

(n-1/n)

Reinforcement Learning via Self-Distillation

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RL...

arxiv.org

January 29, 2026 at 7:44 PM

Jonas Hübotter

@jonhue.bsky.social

One of my favorite experiments in the paper was seeing that SDPO can discover novel solutions to hard binary-reward problems.

SDPO allows learning even before seeing any reward! Simply by sequentially fixing "errors" as the model encounters them.

(7/n)

January 29, 2026 at 7:41 PM

Jonas Hübotter

@jonhue.bsky.social

The key idea behind SDPO is to leverage a model's ability to learn in-context. We show that the gains of SDPO scale when scaling the base model.

In other words: Better models → better retrospection in SDPO → better models

(6/n)

January 29, 2026 at 7:41 PM

Jonas Hübotter

@jonhue.bsky.social

RLVR doesn't just lead to poor credit assignment, it learns reasoning that is inefficient! RLVR's learned reasoning style is verbose and often circular.

SDPO demonstrates that effective reasoning does not have to be verbose!
How? The self-teacher penalizes useless tokens.

(5/n)

January 29, 2026 at 7:40 PM

Jonas Hübotter

@jonhue.bsky.social

One of our results: We train Olmo3-7B-Instruct on a new task.

SDPO achieves GRPOs 5h accuracy in 30min wall-clock time and SDPO converges to 20%pts higher accuracy.

Also, SDPO learns more concise reasoning (see below).

(4/n)

January 29, 2026 at 7:40 PM

Jonas Hübotter

@jonhue.bsky.social

Why does this work? When conditioned on rich feedback, the model retrospectively evaluates its initial attempt. Anything that seems wrong in hindsight is discouraged. Anything that was good is encouraged.

This leads to interesting patterns of advantages 👇

(3/n)

January 29, 2026 at 7:40 PM

Jonas Hübotter

@jonhue.bsky.social

Introducing Self-Distillation Policy Optimization (SDPO).

Key insight: Putting environment feedback (like runtime errors) and successful attempts in-context, turning the model into its own teacher.

Bonus: Virtually same runtime as GRPO!

(2/n)

January 29, 2026 at 7:40 PM

Jonas Hübotter

@jonhue.bsky.social

Paper: arxiv.org/pdf/2410.05026

Joint work with the amazing @marbaga.bsky.social, @gmartius.bsky.social, @arkrause.bsky.social

July 14, 2025 at 7:38 PM

Jonas Hübotter

@jonhue.bsky.social

We propose an algorithm that does this by actively maximizing expected information gain of the demonstrations, with a couple of tricks to estimate this quantity and mitigate forgetting.
Interestingly, this solution is viable even without any information about pre-training!

July 14, 2025 at 7:35 PM

Jonas Hübotter

@jonhue.bsky.social

Our method significantly improves accuracy (measured as perplexity) for large language models and achieves a new state-of-the-art on the Pile benchmark.

If you're interested in test-time training or active learning, come chat with me at our poster session!

April 21, 2025 at 2:40 PM

Jonas Hübotter

@jonhue.bsky.social

We introduce SIFT, a novel data selection algorithm for test-time training of language models. Unlike traditional nearest neighbor methods, SIFT uses uncertainty estimates to select maximally informative data, balancing relevance & diversity.

April 21, 2025 at 2:40 PM

Jonas Hübotter

@jonhue.bsky.social

Paper: arxiv.org/pdf/2410.08020

April 21, 2025 at 2:38 PM

Jonas Hübotter

@jonhue.bsky.social

Unfortunately not as of now. We may also release Jupyter notebooks in the future, but this may take some time.

February 12, 2025 at 10:25 PM

Jonas Hübotter

@jonhue.bsky.social

I'm glad you find this resource useful Maximilian!

February 11, 2025 at 3:26 PM

Jonas Hübotter

@jonhue.bsky.social

Noted. Thanks for the suggestion!

February 11, 2025 at 9:01 AM

Jonas Hübotter

@jonhue.bsky.social

Very glad to hear that they’ve been useful to you! :)

February 11, 2025 at 8:37 AM

Jonas Hübotter

@jonhue.bsky.social

table of contents:

February 11, 2025 at 8:35 AM

Jonas Hübotter

@jonhue.bsky.social

Huge thanks to the countless people that helped in the process of bringing this resource together!

February 11, 2025 at 8:20 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news