Love papers like RootFree (RF).
It's well-motivated, well-written, has decent ablations, and most importantly, is a simple yet elegant method.
The key idea of RFAdamW seems to be it moves away from the notion that "2nd-order statistics scale the per-parameter learning rate" and instead embraces their curvature information.
This way, they close the generalization gap between SGD and AdamW without compromising early convergence.
As the implementation is a one-line change, I've added RF and grafted RF to
github.com/ClashLuke/schedul….
Grafted RF (following
openreview.net/forum?id=FpKg…) allows the direct transfer of tuned SFAdamW hyperparameters to see immediate gains without retuning.
See below for an incomplete ranking on a toy problem:
x.com/LinYorker/status/18130…
#ICML2024
Can We Remove the Square-Root in Adaptive Methods?
arxiv.org/abs/2402.03496
Root-free (RF) methods are better on CNNs and competitive on Transformers compared to root-based methods (AdamW)
Removing the root makes matrix methods faster: Root-free Shampoo in BFloat16 /1