Breaking the Synchronization Bottleneck in Distributed Training with AsyncMesh.
Communication overhead in synchronous data and pipeline parallelism restricts distributed training of large language models to co-located clusters with high-bandwidth interconnects. Our recent work from
@Pluralis introduces AsyncMesh, which enables fully asynchronous optimization across both parallelism axes. By eliminating blocking communication, this avoids idle time, improves throughput, and enables efficient utilization of heterogeneous hardware.
Asynchrony, however, introduces optimization challenges due to staleness between PP stages and DP replicas. For PP, we use our prior Nesterov-style weight look-ahead method to compensate for stage-dependent gradient delay. For DP, we introduce asynchronous sparse averaging, communicating only a small subset of parameters, and correcting delay via an EMA-based staleness estimator. We observe that sparse averaging is inherently robust to weight inconsistencies (e.g., staleness and quantization noise), making it well-suited for asynchronous settings while also substantially reducing data transfer between replicas.
Empirically, we observed no performance degradation compared to fully synchronous training across a range of LLM training configurations, while significantly reducing communication overhead. More broadly, AsyncMesh makes distributed training feasible beyond co-located, high-speed clusters, facilitating large-scale collaborative training over the internet.
The attached video illustrates the key concepts of the method and the paper can be found here:
arxiv.org/abs/2601.22442.