👀Got some requests to post this great paper review in English - fulfilling.
🤯Have We Finally Found Adam's Successor?
👽There are very few things in deep learning that have remained largely unquestioned for an entire decade. Adam is one of them.
Since 2014, countless papers have tried to replace it. Some promised faster convergence, others lower memory usage, and some showed impressive results on specific models. Yet when researchers train real systems, most still return to Adam or AdamW.
The reason is simple: outperforming Adam on one metric is easy. Matching it across all the metrics that matter is much harder.
This is where Gefen, a new optimizer by
@nadavbenedek (Reichman University), Tomer Koren (Tel Aviv University and Google), and Ohad Fried (Reichman University), enters the picture.
The goal sounds almost impossible: keep Adam's performance, hyperparameters, and ease of use, while dramatically reducing its memory footprint.
According to the paper, they succeed.
Gefen reduces optimizer memory by 8x compared to AdamW while maintaining comparable performance across language and vision models, including GPT-2, Llama 3, and ResNet. In large-scale training, where optimizer states consume a substantial fraction of GPU memory, this can be transformative. Freed memory can be used for larger models or larger batches. In one experiment, the increased memory budget doubled the microbatch size per GPU and improved training throughput by 56%.
How does it work?
Adam stores two statistics for every parameter: a first moment (a running average of gradients) and a second moment (a running average of squared gradients). These statistics drive Adam's success, but they are also responsible for most of its memory consumption.
The paper's key theoretical result shows that when two parameters are strongly coupled through the Hessian, the ratio of their squared gradients tends toward one. In practical terms, their second-moment statistics become very similar. If two parameters repeatedly produce nearly identical squared gradients, storing separate second-moment estimates becomes wasteful.
Gefen therefore groups parameters and lets them share second-moment statistics. Instead of storing billions of independent values, a single value can represent an entire block of parameters.
Computing Hessians directly would be infeasible, so the authors infer these groups from the squared-gradient patterns observed during the first training step. The process is automatic and requires no architectural knowledge or manual tuning.
To reduce memory further, Gefen compresses the first moment using 8-bit quantization. Unlike most quantized optimizers, which rely on handcrafted codebooks, Gefen computes an optimal codebook using dynamic programming on a histogram of gradient values.
The result is a true drop-in replacement for AdamW: no new hyperparameters, no learning-rate retuning, and substantially lower memory consumption.
Will it replace Adam? It is too early to say. But it is one of the strongest attempts in years to preserve everything practitioners like about Adam while returning most of the memory it consumes.
Paper:
arxiv.org/pdf/2606.13894
Code:
github.com/ndvbd/Gefen