I'm thinking of doing either a webinar or a blog post called "All about MoEs". Would this be interesting? Which format? Which other topics should I include?
Topics that could be covered:
- What are sparse models and MoEs?
- Sparsely activated MoEs and Top-k gating
- MoEs and transformers
- Load balancing
- Overview of OS MoEs (Mixtral, Switch Transformer, OpenMoE)
- Quantization and MoEs
- What is an "expert"?
- Expert parallelism and why MoEs are interesting for pre-training
- MoEs for local usage vs high usage deployment
- Challenges of fine-tuning MoEs
- DeepSeekMoE
- How to compare MoEs to dense models?
- Training MoEs from dense checkpoints
- Model merging and MoE merges
Papers to cover: Outrageously Large NN (2017), ZeRO (2019), GShard (2020), GLaM (2021), DSelect-k (2021), Hash Layers (2021), BASE layers (2021), Switch Transformers (2022), ST-MoE (2022), FasterMoE (2022), MegaBlocks (2022), A Review of Sparse Expert Models (2022), Unified Scaling Laws for Routed Language Models (2022), Sparse Upcycling (2022), Mixture-of-Experts Meets Instruction Tuning (2023), QMoE (2023), Mixtral (2023), DeepSeekMoE (2024)