i was deliberately flattening ptx as ir/virtual isa ptx embedded for forward compat ptx jit vs native cubin the fallback behavior, as such, because on dgx spark hardware, a fallback in this manner (not emitting native cubin) is slower at loadtime, a time penalty that is incurred on first load/cache invalidation/miss, because at buildtime the cpu does cuda c->ptx->stop, producing a binary that contains ptx, which then, at loadtime, needs to be consumed by the driver jit to create native cubin containing sm_121-specific machine code; contrast with compiling to native cubin, where, at buildtime, cpu does cuda c->ptx->ptxas->sass, resulting in native cubin containing sm_121-specific machine code, which can be loaded directly for that target at loadtime, thereby skipping the ptx jit step.