Apple just released its programming guide for Metal Performance Primitives, and they suggest using Morton codes for tiled GEMM, but why?
In computer graphics, you use such space-filling curves all of the time
It makes objects that are close in space to be close in memory
There are several reasons, but one of them is that you get better cache locality, meaning less expensive reads from the device memory
This is exactly why it’s appealing for GEMM too - you have a lot of overlapping memory reads between the tiles
Morton schedules tiles in compact square patches, minimizing the working set that fits in last-level cache simultaneously, so nearby threadgroups are more likely to reuse the data they share