Just recalled something, in my last year's work on NextCoder. We found that moving base was sub-optimal as compare to the fixed base. (Which produces the reasoning of constrained KL-div between the base and ckpt, leading to benefits, kinda artifcat parallel to RL)
But inspired from Dino-V1, what if using the moving base but instead of big ckpt jump, we take an exponential moving average and doing SeleKT updates as per that.
I don't have compute and time both, so if anyone wants to take this, feel free to.