I think there's some confusion about what on-policy distillation (OPD) loss actually optimizes. So here's some math:
If you write down the *sequence-level* reverse KL between a student and teacher model, the gradient would have two terms (see the image below).
- MiniLLM backpropogates through the student's sampling distribution, which makes it akin to policy gradient with dense rewards (but noiser).
- OPD does not add this term, thus it akin to supervised learning, simpler to implement and similar to what DAGGER does.
As such, almost all of the OPD codebases (e.g, Tinker, verl, huggingface) use the implementation suggested by the GKD paper (one on the left).
We did try what MiniLLM suggested but just couldn't get it to work in practice -- overall, my impression back then was that it adds a bunch of complexity and needs stabilization tricks, without clear gains. That said, it is an interesting direction worth studying that why treating distillation as an RL problem doesn't work well or bring benefits (maybe it does in really long-horizon tasks?)
PS: MiniLLM changed their paper title (!) after Thinking Machines blog post LOL to catch the on-policy distill cites ;)
PPS: GKD is a terrible name
long overdue website overhaul. check it out, link in the comments:)