Dan Busbridge
@dbusbridge.bsky.social
34 followers 61 following 21 posts
Machine Learning Research @ Apple (opinions are my own)
Posts Media Videos Starter Packs
Pinned
dbusbridge.bsky.social
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:

"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"

Our distillation scaling law shows, well, it's complicated... 🧵

arxiv.org/abs/2502.08606
Distillation Scaling Laws
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated ...
arxiv.org
dbusbridge.bsky.social
We do observe that students improve with longer distillation training (i.e., patient teaching works). Additionally, with particularly long distillation durations, we approach what supervised learning can achieve (as limited by model capacity, in our experimental setting).
dbusbridge.bsky.social
In contrast, our teachers and students are trained on the same data distribution, and we compare with supervised models that can access the same distribution. This lets us to make statements about what distillation can do given access to the same resources.
dbusbridge.bsky.social
I.e. Beyer et al's students do not see the teacher training distribution, and there is no supervised baseline where, for example, a model has access to both INet21k and the smaller datasets. This type of comparison was not the focus of their work.
dbusbridge.bsky.social
Beyer et al.'s teachers are trained on a large, diverse dataset (e.g., INet21k) then fine-tuned for the target datasets (e.g., Flowers102 or ImageNet1k). Students are distilled on the target datasets and only access the teacher's training distribution indirectly.
dbusbridge.bsky.social
These differences, outlined in Table 4, relate primarily to the relationship between student and teacher.
dbusbridge.bsky.social
As is often the case with distillation, the situation is nuanced, which is why we provide a dedicated discussion in Appendix D.1.

The key message is that despite an apparent contradiction, our findings are consistent because our baselines and experimental settings differ.
dbusbridge.bsky.social
Several people have asked me to comment further on the connection between our work and the Patient and Consistent Teachers study by Beyer et al., as In Section 5.1, we note that our findings appear to contradict theirs.

arxiv.org/abs/2106.05237
dbusbridge.bsky.social
This investigation answered many long-standing questions I’ve had about distillation. I am deeply grateful for my amazing collaborators Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb, whose expertise and support brought this project to life.
dbusbridge.bsky.social
So, distillation does work.

If you have access to a decent teacher, distillation is more efficient than supervised learning. Efficiency gains lessen with increasing compute.

If you need to train a teacher, it may not be worth it, depending on other uses for the teacher.
dbusbridge.bsky.social
5. Using fixed aspect ratio (width/depth) models, enables using N as non-embedding model parameters, and allows a simple, accurate expression for FLOPs/token which works for large contexts. I think this would be useful for the community to adopt generally.
dbusbridge.bsky.social
4. We present the first synthetic demonstrations of the distillation capacity gap phenomenon, using kernel regression (left) and pointer mapping problems (right). This will help us better understand why making a teacher too strong can harm student performance.
dbusbridge.bsky.social
3. Training on the teacher logit signal produces a calibrated student (left) whereas training on the teacher top-1 does not (right). This can be understood using proper scoring metrics, but it is nice to observe clearly.
dbusbridge.bsky.social
2. Maximum Update Parameterization (MuP, arxiv.org/abs/2011.14522) works out of the box for distillation. The distillation optimal learning rate is the same as the base learning rate for teacher training. This simplified our setup a lot.
dbusbridge.bsky.social
We also investigated many other aspects of distillation that should be useful for the community.

1. Language model distillation should use temperature = 1. Dark knowledge is less important, as the target distribution of natural language is already complex and multimodal.
dbusbridge.bsky.social
When using an existing teacher, distillation overhead comes from generating student targets. Compute-optimal distillation shows It's better to choose a smaller teacher, slightly more capable than the target student capability, rather than a large, powerful teacher.
dbusbridge.bsky.social
The distillation scaling law extrapolates with ~1% error in student cross-entropy prediction. We use our law in an extension Chinchilla, giving compute-optimal distillation for settings of interest to the community, and our conclusions about when distillation beneficial.
dbusbridge.bsky.social
To arrive at our distillation scaling law, we noticed that teacher size influences student cross-entropy only through the teacher cross entropy, leading to a scaling law in student size, distillation tokens, and the teacher-cross entropy. Teacher size isn't directly important.
dbusbridge.bsky.social
These results follow the largest study of distillation we are aware of, with 100s of student/teacher/data combinations, with model sizes ranging from 143M to 12.6B parameters and number distillation tokens ranging from 2B to 512B tokens.
dbusbridge.bsky.social
Compared to supervised training:
1. It is only more efficient to distill if the teacher training cost does not matter.
2. Efficiency benefits vanish when enough compute/data is available.
3. Distillation cannot produce lower cross-entropies when enough compute/data is available.
dbusbridge.bsky.social
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:

"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"

Our distillation scaling law shows, well, it's complicated... 🧵

arxiv.org/abs/2502.08606
Distillation Scaling Laws
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated ...
arxiv.org