Furthermore, we find that the token-level granularity in self-distillation is both a strength and a weakness – it's dense, but there is often a great deal of noise in the updates because the teacher and student may disagree on tokens for reasons unrelated to the desired behavior. We want to concentrate our loss on the tokens which matter most for the student's improvement.
This is the intuition behind RMSD: we filter for a set of T token positions where the teacher and student disagree the most, then an LLM judge narrows those to a final set of S tokens that are most relevant to the target behavior.
In our experiments, RMSD reaches the target behavior in about half the steps of OPSD, making it significantly more data efficient, while also training more stably and reaching a higher performance ceiling.
We think that the self-distillation approach and RMSD in particular will be powerful methods for advancing the capabilities of models on specialized settings.