So, er, the gist is "use fp16 instead of un16/sn16 if the precision is good enough for you, and load instead of sample if you don't need filtering".
Kinda makes sense if you consider where the optimized paths in modern GPUs might lie, but the outliers show it's not that easy :)