Yilong Chen

Yilong Chen

4 Photos and videos

Tweets

Pinned Tweet

Yilong Chen

@Yichen4NLP

Mar 6

We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.0… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining

Mixture of Universal Experts: Scaling Virtual Width via...

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose...

arxiv.org

112

33,150

Andrej Karpathy

Yilong Chen retweeted

Andrej Karpathy

@karpathy

Mar 9

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanochat… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

960

2,124

19,520

3,659,703

Yilong Chen

Yilong Chen

@Yichen4NLP

Mar 6

Mixture of Universal Experts: Scaling Virtual Width via...

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose...

arxiv.org

112

33,150

Yilong Chen

Yilong Chen

@Yichen4NLP

Mar 6

Tagging a few communities and curators who might find this interesting: @_akhaliq @huggingface @dair_ai @the_gradient @rasbt Would greatly appreciate any feedback or discussions!

699

Yilong Chen

Yilong Chen

@Yichen4NLP

Mar 6

The UELB point is central. Under reuse, load balancing should not just be layer-local. It should reflect the computation graph. That gives a new depth-wise / topology-aware view of load balancing: balance experts relative to where they can be used, not how often they appear globally. This is a different optimization problem from standard MoE.

1,249

Yilong Chen

Yilong Chen

@Yichen4NLP

Mar 6

The result is a useful scaling trade: instead of buying capacity mainly with more activated compute or more stored parameters, we can trade **algorithmic structure** for capacity by increasing global reusable experts and their recursive compositions. In practice: - up to 1.3 avg from scratch with no increase in activated params or total params - ~ 2.5 in depth expansion - up to 4.2% avg in checkpoint conversion / CPT Our bet is that MoE may scale not only by adding more experts, but by making experts more reusable, modular, and globally composable. That is the direction behind MoUE.

837

Yilong Chen

Yilong Chen

@Yichen4NLP

4 Jan 2025

🚀 Excited to share our latest work on next-gen pretrained model architecture: Mixture of Hidden-Dimensions Transformer (MoHD)! 🔗 arxiv.org/pdf/2412.05644 🌟 Results: •Compress 50% activation parameters while improving performance by 1.7%. #AI #DeepLearning #Transformer #NLP

167

Yilong Chen

Yilong Chen

@Yichen4NLP

4 Jan 2025

✨ Tackling the challenge of scaling hidden dimensions in large models, we introduce sparse activation in hidden dimensions and a novel information flow maintenance mechanism. MoHD achieves unmatched parameter efficiency and scalability.

122

Yilong Chen

Yilong Chen

@Yichen4NLP

4 Jan 2025

Results: •Compress 50% activation parameters while improving performance by 1.7%. •Scale up parameters 3x (without increasing activation) and boost performance by 3.7%. 💡 MoHD offers a new perspective for scaling laws, paving the way for pushing their limits!

108