1/ Let me chip in on the recent “which optimizer rules them all” discussion with a somewhat more moderate take, asking:
What Schatten-p norm to use?
Turns out the answer is regime dependent! Specifically, even when smooth in Schatten-∞, Muon is not necessarily the best choice.