SPEED IS THE MOAT: AMD ROCm software stack has improved performance by over 75x in the last 14 days since DeepSeekv4 launch. The performance comes from fusing mHC operations & also fusing RoPE hadamard transformations to reduce cpu overhead & improve HBM memory utlization. Furthermore, other kernels like the attention indexer & kvcache compressor has been written using TileLang & Triton for fast development velocity.
Another 5x performance improvement is needed to catch up to single node aggregated B200 performance & then another 1.5x is needed to catch up to PD disaggregated B200 performance, which is within the realm of possibility for AMD within the next couple of weeks. Great work to HaiShaw, Thomas,
@roaner,
@AnushElangovan for this rapid improvement.