Thank you for this result; I appreciate it. Here's one initial correction:
In his post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.
But this is incorrect: I did not implement my own version of Shampoo.
Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research - for which it so happens that
@_arohan_ has contributing credits.
If there are indeed bugs in this official implementation, I can safely say that they were not created by me.
The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?
I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.