My take on "LoRA Learns Less and Forgets Less"
1) "MLP/All" did not include gate_proj. QKVO, up & down trained but not gate (pg 3 footnote)
2) Why does LoRA perform well on math and not code? lm_head & embed_tokens wasn't trained, so domain shifts not modelled. Also reason why "LoRA Forgets Less". Use "modules_to_save" in HF PEFT or "lm_head", "embed_tokens" in
@UnslothAI
3) Code rank=256 used α=32 (too small!) (pg 18), but Maths α=2*r=512. RS LoRA paper showed α/sqrt(r) needed for larger ranks. & common practice is 2*r. So also why Code did worse than Maths
4) Extrapolating Maths vs fft looks good. Small datasets LoRA>fft, but I theorize that's because of reason 2
5) LoftQ & PiSSA paper init LoRA from SVD(W) => papers show comparable perf of LoRA
6) LoRA paper shows B matrix needs larger lr. DoRA (mentioned in paper) learns these scalars.
TLDR: Code worse since α=32 is too small. No embed_tokens, lm_head (or layernorms), not even gate_proj? Better init & lr scaling can help
For continued pretraining, I advise people to train on all layers (inc gate) lm_head, embed_tokens, use RS LoRA, use rank>=256
LoRA paper:
arxiv.org/abs/2405.09673
RS LoRA paper:
arxiv.org/pdf/2312.03732
LoRA paper:
arxiv.org/pdf/2402.12354
PiSSA paper:
arxiv.org/pdf/2404.02948
DoRA paper:
arxiv.org/pdf/2402.09353