は?そんなことできんの?
Clustering NVIDIA DGX Spark M3 Ultra Mac Studio for 4x faster LLM inference.
DGX Spark: 128GB @ 273GB/s, 100 TFLOPS (fp16), $3,999
M3 Ultra: 256GB @ 819GB/s, 26 TFLOPS (fp16), $5,599
The DGX Spark has 3x less memory bandwidth than the M3 Ultra but 4x more FLOPS.
By running compute-bound prefill on the DGX Spark, memory-bound decode on the M3 Ultra, and streaming the KV cache over 10GbE, we are able to get the best of both hardware with massive speedups.
Short explanation in this thread & link to full blog post below.