Last week we explored NVIDIA's CuTe layouts. Today, we put that theory into practice. Part 2 is out now! Most CuTe examples skip straight to highly optimized code without explaining the reasoning. Join me as we build a MM kernel with CuTe, that can beat cuBLAS in certain cases 🧵