This paper is pretty cool; through careful tuning, they show:
- you can train LLMs with batch-size as small as 1, just need smaller lr.
- even plain SGD works at small batch.
- Fancy optims mainly help at larger batch. (This reconciles discrepancy with past ResNet research.)
- At small batch, optim hparams are very insensitive!
I find this cool for two reasons:
1) When we did ScalingViT I also surprisingly found (but never published) that pure SGD works much better than expected. However, a small gap always remained, so we dropped it in favour of (our variant of) AdaFactor. The results here confirm this.
2) This is really good news for fine-tuning on small data with few GPUs. Drop the LoRA and do full fine-tuning with tiny batch-size and plain SGD!
A word of caution, because:
A) This is mostly done at tiny scale (30M params), to allow running many experiments. It is unclear how true the results remain at larger scale, although they do show a 1.3B result, it's usually after larger than 7B that things start to get more difficult.
B) This was all with transformers with QK-Norm, which has very stabilizing effect, I'd be curious if it holds without, but I give it a chance that it might.
C) For large-scale training, running on many (>10k) chips, large batch size is a necessity, not a choice. And they do show that at it's at large batch (not even that large: 4k) fancy optimizers matter significantly.