47/100 of GPU Grind
following stanford cs149 with lecture 3, covering cpu multithreading to hide stalls and maximise core utilization, the example of Intel Kaby-Lake cpu with superscalar core in which multiple instructions can run per clock cycle. Also covering heterogeneous superscalar cores (with scalar and SIMD ALUs) that can run in the same clock cycle
also covering GPU SIMT with a simplified explanation of warps and warp divergence
i love re-watching a course about "beginner" concepts of HPC and see how it is explained by different professors, the laundry pipelining analogy was funny 😆
46/100 of GPU Grind
starting to work on a fp16 gemm kernel, playing with the __half api for now, all intrinsics it feels like i'm writing avx512 but in a cuda program. i'm setting up all the reference computations etc, and i was surprised to see the difference between fp64 flops and fp16 flops (i run it on a 2060 for now, going to run it on ampere ultimately to be able to use more features).
like the fp64 to fp16 ratio for cuBLAS gemms is 1/36, which is not even that much considering the hardware peak of fp64 is 1/32 of fp32 which is 1/2 of fp16, it's just that i forgot the chip had that few fp64 cores. the way they say it in the whitepaper is literally "we just included bare minimum fp64 cores so that fp64 program can run correctly". i knew that at some point but i forgot and was still surprised 😅