Jacob Springer
@jacobspringer.bsky.social
140 followers 130 following 10 posts
Machine Learning (the science part) | PhD student @ CMU
Posts Media Videos Starter Packs
jacobspringer.bsky.social
For the theorists in the room: we dive deeper into why this happens using a linear transfer learning setup, revealing that incremental learning leads to catastrophic overtraining.

9/10
jacobspringer.bsky.social
Fine-tuning behaves similarly: using a fixed learning rate across different pre-training checkpoints, we see eventual degradation in both task performance and web-data perplexity. This often holds even after hyperparameter tuning. Overtraining = worse fine-tuning outcomes!

8/10
jacobspringer.bsky.social
👉 Early in training: Models have low sensitivity & the base model improves quickly; performance improves 📈
👉 Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! 📉

7/10
jacobspringer.bsky.social
What's happening? Beyond Gaussian perturbations, extended pre-training increases model sensitivity to all types of parameter updates 👇

6/10
jacobspringer.bsky.social
🔹 Early checkpoints: Robust to parameter changes.
🔸 Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)

5/10
jacobspringer.bsky.social
Let’s step back and consider a simpler setting: we train our own 30M parameter models and test how Gaussian noise affects model parameters at different pre-training stages👇

4/10
jacobspringer.bsky.social
Example: OLMo-1B trained on 3T tokens performs over 2% *worse* after instruction tuning than its 2.3T-token version—even though it saw 30% more data! We see similar observations for many other post-training setups.

Why does extended pre-training hurt fine-tuning performance? 🤔

3/10
jacobspringer.bsky.social
The latest language models are pre-trained on more and more tokens while holding the number of model parameters fixed—and this trend isn't slowing down!
➡️ Better base models? Yes.
➡️ Better starting point for post-training? Let’s check!

2/10
jacobspringer.bsky.social
Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!
Introducing "catastrophic overtraining." 🥁🧵👇

arxiv.org/abs/2503.19206

1/10