For the benefit of others, I want to document a bug in the Nvidia GB10 chipset devices such as DGX Spark, also variations made by companies like MSI.
This Nvidia bug affects all GB10-based systems (NVIDIA DGX Spark, ASUS Ascent GX10, and by extension MSI's EdgeXpert/GB10 variant) because they share the same SoC and ConnectX-7 wiring.
Two DGX Sparks connected via QSFP, with both interfaces negotiating 200 Gbps via ethtool, but actual throughput capped at ~13 Gbps under both TCP (iperf3) and RDMA (ib_write_bw).
So instead of 200 Gbps or 120 Gbps between two boxes, you get just 12.9 Gbps which is super super slow when trying to distribute an LLM.
The root cause is: "The ConnectX-7 firmware reports "insufficient power on the PCIe slot (27W)" and throttles both PCIe domains. RDMA hits the same wall as TCP, which rules out the kernel networking stack and points to firmware/hardware below the software layer."
Updating the driver from 580.126 to 580.142 via apt full-upgrade resolves it completely. The power warning persists in logs but no longer throttles. Use apt full-upgrade to achieve this (with sudo of course).
Problem solved. Hope this saves you some time. NVIDIA should have told customers about this, and they should have shipped the units with the updates in place, but they didn't.