We wrote our own inference engine in Rust 🦀 for our AMD MI355Xs, with no ROCm anywhere underneath it.
On the small all-reduce that tensor-parallel inference runs for every single token, it comes out about 1.3x faster than AMD's own RCCL at its best, sitting on what looks like the physical latency floor of the chip at around twenty microseconds.
As the first anywhere to have the MI355X fully deployed in production, running Matilda, inference is a huge focus for
@MaincodeAU. In fact inference is the part of an AI business that actually meets customers, so we wanted to understand this hardware as deeply as we could. We took the runtime out and drove the GPU through the Linux kernel driver directly, the same interface ROCm itself is built on, to see how close to the silicon we could get.
RCCL is genuinely great and has years of careful work behind it, and it also has to be good at everything a runtime is ever asked to do. A layer that only has to do one thing really well can set all of that aside and cut its overhead to almost nothing. The win isn't a clever trick, just the absence of overhead.
And you don't have to take our word for it. The whole thing ships as a single container image with the kernels compiled in, and three commands reproduce the numbers on any MI355X box.
More in the blog post...