I designed a management tool for a 4× DGX Spark Cluster that automatically downloads a model when given a Hugging Face model ID, distributes it across all 4 machines, and serves it through the head node.
The model is first downloaded to the head node, then synchronized to the other DGX Spark nodes over the 200G fabric, and launched as distributed inference powered by vLLM/Ray.
Thanks to NVFP4 support, I was able to run massive MoE models such as Qwen3.5-397B-A17B-NVFP4 across 4 nodes. The tool also displays OpenWebUI connectivity, cluster health checks, node-level unified RAM usage, and aggregate tok/sec benchmark metrics on a single dashboard.
This means model selection, deployment, restart, stop, and performance testing no longer require SSH’ing into each machine one by one. 🎉
I’ll be releasing the tool this week. 🎉❤️ Huge thanks to
@NVIDIAAI for building these incredible devices, and to
@ASUSTR for their support. 🚀