Progress! < $10/hr to pre-train a 20b MoE. Not too bad.
8 VMs, each with a single l40s GPU, spread across Poland, France, and the USA, with a few extra 4090s and a 5090, training a 20B MoE @ 6s/step (I also had it fully functional with only 4090s, just slower)
Should have results soon, then we can try this out on a much, much larger model!