@Apple never publicly released async device→threadgroup copies for user kernels leaving a lot of performance on the table. But it leaked briefly in Xcode 14.2 as simdgroup_event::async_copy, then got pulled before 14.3.
So with the help of claude, I tried to reverse engineer Apple's private Metal async copy API hiding inside MPSMatrix.framework on my M4 Mac.
Turns out Apple evolved the API internally. Running strings c filt on the MPS metallib reveals the current private signature -
metal::simdgroup_future<void>
metal::simdgroup_async_copy<float, void>(
float threadgroup* dst,
ulong, ulong, ulong2, // dst stride tile
float const device* src,
ulong, ulong, ulong2, // src stride tile
long2, // offset
metal::simdgroup_async_copy_clamp_mode
)
Instantiated for float, half, short, signed char.
The AIR intrinsics confirm it's real:
air.simdgroup_async_copy_2d.p3i8.p1i8
air.wait_simdgroup_events
Your M4 can do it. MPS does it. You can't.