> mini-kernel-lib now has 2 FP32 GEMM ref paths, fused ReLU, 1-axis reductions, and direct NCHW conv2d forward, plus workspace checks, tests, benches, and opt-in GEMM
> autotune. still reference-only; next step is real GPU kernels and tighter dispatch.