Programming and compiling for AI hardware happen today through multiple approaches, but they broadly fall into low-level, mid-level, and high-level ones. (1) Low-level is based on hand-written kernels (CUDA, CUTLASS, assembly/PTX, similar) on which the default execution engines of AI frameworks (eager-mode) are built; (2) mid-level is mostly OpenAI Triton, Pallas, and cuTile (announced); and (3) high-level ones, which are compilers for PyTorch/JAX/TF functions, e.g. Inductor, XLA, and PolyBlocks. One would stop building low-level libraries/kernels if mid-level ones (e.g. Triton) delivered good/comparable performance; similarly, one would stop relying on hand-written mid-level if high-level compilers delivered comparable performance. Results here compare the high-level fully automatic approach of PyTorch-PolyBlocks against OpenAI Triton and against lower-level kernels (CuBLAS) on an NVIDIA GPU.
While Inductor is an automatic compiler, it still relies on a combination of low-level (e.g. cublas/cutlass kernels) and mid-level frameworks (Triton); for matmul, perhaps exclusively cublas/cutlass (so no code generation).
For Triton, the implementation used was from
triton-lang.org/main/getting…
Results here show high performance can be achieved with compact/productive programming if compilers are built right!
Know more:
docs.polymagelabs.com