I was lucky to work in both China and the US LLM labs, and I've been thinking this for a while. The current values of pretraining are indeed different:
US labs be like:
- lots of GPUs and much larger flops run
- Treating stabilities more seriously, and could not tolerate spikes in large flops run, thus invented so many stability-related tricks, including all kinds of soft-cap, MuP, and spectral norm control tricks
- Treats predictabilities more seriously. Check GPT 4 report for reference, even trying to predict the eval task performances
- Because of the stability and predictability ask, treats hyper-params and optimization more seriously
- Generally believe more in data, optimization than arch
China labs be like:
- has very limited GPUs, e.g. k2 in 4k GPU and v3 in 2k GPU
- as a result, pushing for the limit of pretrain modeling-infra co-design, see so many tricks in V3, and K2 has some cool stuff too (the offload trick helps remove the stupid MoE gating constrain and only uses EP 16)
- cares model arch/token efficiency over optimization, stability
- cares more about data quality than data quantity
- taking inference into consideration day 0, even before the training starts
In general, China labs are trying to use <4e 24 flops models to catch up with >1e 25 flops models. It is hard or impossible, but they are making good progress.
I am actually very happy to see Qwen's new try on model archs, they used to be focusing more on data side rather than on model arch side. They developed linear attn, not just for people to think they are innovating, it is actually considering pushing the limit for test time scaling. Llama4 failed for many reasons, but qwen-next is different. They just used very limited flops and it is a brave try for good reasons.
I bet OpenAI/xAI is laughing so hard, this result is obvious tbh, they took a permanent architectural debuff in order to save on compute costs.