[New paper] If you are sampling multiple outputs from a teacher LLM (e.g., Gemini 1.5 GPT), ranking them, and fine-tuning the student on the best output, you can do better.
Simple idea: Fine-tune / Distill on the top-k outputs instead. Consistent gains on machine translation.