🚨
#BREAKING: CoreWeave
$CRWV Trained One of the Most Demanding AI Models Ever Benchmarked in Just Over Two Minutes on the Largest GB300 GPU Cluster Submitted to MLPerf.
What happened:
➜ CoreWeave
$CRWV announced on June 16, 2026, that it achieved the fastest DeepSeek-V3 671B training performance among all cloud submissions in the MLPerf Training v6.0 benchmark.
➜ MLPerf is the industry-standard AI benchmarking suite developed by the MLCommons consortium.
➜ CoreWeave trained the 671-billion-parameter DeepSeek-V3 model to target quality in 2.02 minutes.
➜ The run used 8,192 Nvidia
$NVDA GB300 NVL72 systems.
➜ Each GB300 NVL72 system connects 72 Blackwell Ultra GPUs and 36 Grace CPUs through NVLink.
➜ The clusters were interconnected using Nvidia Spectrum-X Ethernet.
➜ CoreWeave was the only submitter in this MLPerf round to scale a GB300 DeepSeek-V3 deployment beyond 2,048 GPUs.
➜ It was the largest GB300 cluster submitted in the benchmark.
Why DeepSeek-V3 is the hardest test in this round:
➜ MLCommons introduced DeepSeek-V3 as a new pretraining benchmark in MLPerf Training v6.0.
➜ The benchmark reflects the latest class of frontier AI models.
➜ DeepSeek-V3 is a 671-billion-parameter mixture-of-experts model and serves as the base model for DeepSeek-R1.
➜ Mixture-of-experts models activate only a portion of their total parameters for each task.
➜ While this improves efficiency, it also creates significant communication overhead between GPUs as workloads are dynamically routed between experts.
➜ That communication overhead is one of the primary scaling challenges the benchmark measures.
The scaling results:
➜ On 8,192 GPUs, CoreWeave completed training in 2.02 minutes.
➜ On 4,096 GPUs, training completed in 3.09 minutes.
➜ On 2,048 GPUs, training completed in 5.54 minutes.
➜ As cluster size doubled, training time improved in near-linear fashion.
➜ This indicates the infrastructure continued scaling efficiently instead of encountering the diminishing returns commonly seen in very large GPU deployments.
➜ On a separate 4,096-GPU GB300 deployment, CoreWeave trained Llama 3.1 405B in 9.77 minutes.
➜ CoreWeave said that result achieved near-parity with larger GB200-based deployments while using 20% fewer GPUs.
What belongs to Nvidia and what belongs to CoreWeave:
➜ Nvidia
$NVDA says its GB300 NVL72 platform delivers a 1.6x generational training performance improvement on DeepSeek-V3 compared with GB200.
➜ According to Nvidia, the improvement comes from larger memory capacity and higher power budgets built into the GB300 platform.
➜ Nvidia also said software optimizations alone improved DeepSeek-V3 training throughput by 1.3x over a three-month period on the same GB300 hardware, without changing the chips.
➜ Those hardware and software improvements are Nvidia platform improvements and are available to cloud providers using GB300.
➜ CoreWeave’s differentiation comes from how it operates that hardware at scale.
➜ The company highlighted its Mission Control orchestration layer, which continuously monitors GPU health, firmware, and thermal conditions before and during training.
➜ CoreWeave also uses topology-aware scheduling to keep expert-parallel workloads within the same NVLink domain, reducing cross-rack communication.
➜ Its rail-aware networking is designed to minimize bandwidth hotspots across multi-thousand-GPU clusters.
➜ CoreWeave’s position is that these software and infrastructure layers turn Nvidia’s raw hardware performance into production-scale AI training that customers can deploy today, rather than a one-off benchmark result.
We just trained DeepSeek-V3 671B in 2 minutes.
That's all 671 billion parameters, on 8,192 NVIDIA Blackwell Ultra GPUs connected with NVIDIA Spectrum-X Ethernet.
The fastest DeepSeek-V3 training run anyone's recorded, set in the new MLPerf® Training v6.0 round on the same cloud platform our customers run on every day.