Chunks of tokens with different tokenization biases are not fairly comparable!⚠️⚠️
We thus develop a method to find chunks with low tokenization bias differences (making them *approximately comparable*), then learn to match the likelihoods of those✅
Most distillation methods so far needed the teacher and the student to have the same tokenizer.
We lift this restriction by first identifying comparable chunks of tokens in a sequence (surprisingly, this is not so easy!), then minimizing the difference between their likelihoods.