Almost forgot the obligatory self-promotion...
I'll be presenting this work this afternoon at ICLR, poster #148. Stop by to gain a new understanding of NN optimization!
What causes sharpening Edge of Stability?
Why is Adam > SGD?
How does BatchNorm help?
Why is 1-SAM > SAM > SGD for generalization?
What is Simplicity Bias, really?
Our new work doesn’t answer these questions (well, 𝘮𝘢𝘺𝘣𝘦 the first one)
But it suggests a common cause...