Day 91/365 of GPU Programming
Looking more closely into AMD’s AITER and ATOM today to get a better sense of the full scope of the Kimi K2.5 1T FP4/DeepSeek-R1-0528 FP4 MTP challenges and the differences between phases 1 & 2 like TTFT and TPOT scoring (mainly out of curiosity).
AITER being AMD’s centralized operator library. Basically a unified place for high performance ops with kernels underneath coming from things like Triton, Composable Kernel (CK) and asm. It spans inference but also training kernels and even fused GEMM communication primitives.
And ATOM sitting a level above that. As far as I understand it so far, ATOM's a lite vLLM inspired LLM inference engine / model backend built around AITER kernels with AMD specific execution choices like AITER native attention/MoE/sampling paths, continuous batching style scheduling, graph captured decode and support for TP/DP/EP. I find the separation interesting because it creates this natural stack boundary between operator/kernel optimization in AITER and model level execution serving integration in ATOM. AMD also seems to be using ATOM as an incubation layer for faster inference iteration while still integrating with and creating ways to upstream pieces into vLLM/SGLang rather than keeping everything permanently out of tree.
Also have to say I really like using DeepWiki (thank you
@silasalberti @swyx @ScottWu46) for use cases like this.
@karpathy was right. When documentation is sparse or out of date, being able to converse with a repo is a much cleaner way of extracting information than trying to piece things together based on potentially stale states of development.