Few weeks ago
@anirudhbv_ce shipped TurboQuant in cuTile on a Blackwell B200. Beautiful work - genuinely.
I wanted to see what the same algorithm looks like at the opposite end of the stack: raw CUDA hand-written PTX on a 4GB laptop GPU. No cuTile. No B200. Just nvcc, shared memory, and a lot of nsight-compute.
Quick context — TurboQuant paper proposed compression in KV-cache and vector-search embedding way harder than anything before it, which I have discussed earlier in my blog , please go through it for clearer understanding of the algorithm :
yuvanesh.vercel.app/blogs/Tu…
I implemented it three ways and bench marked them against each other:
1. Vanilla CUDA — two kernels. Rotate (FWHT) in kernel 1, quantize bit-pack in kernel 2. Clean, but the intermediate rotated tensor gets written out to HBM and read back , 64 MB of wasted memory traffic per call at N=65k, d=128. Two launches too.
2. Fused — one kernel. Rotated vector stays in shared memory, quantize reads it from there directly. HBM round-trip gone. 1.10–1.25× faster on quantize, 1.17–1.41× on dequant. Dequant benefits more because its per-coord compute is smaller, so saving the HBM trip is a bigger fraction of runtime.
Hit a fun bug here. The fused kernel agreed bit-for-bit with the unfused version at b=1 and b=2, but differed in exactly 1 coord out of ~4 million at b=4. Cause: --use_fast_math was fusing (smem * scale) - codebook[k] into a single FMA, which rounds once instead of twice. At midpoint ties, that's enough to flip which centroid wins. Fix: pin the scale multiply with __fmul_rn. Bit-exact parity restored.
(cc
@tri_dao — the __fmul_rn / FMA rounding thing felt like exactly the kind of footgun you run into in FlashAttention territory. Curious whether you pin rounding explicitly at ops that matter or just test against a tolerance.)
3. Fused inline PTX. Two experiments. One paid off massively, one did nothing:
pack_signs with warp ballot (vote.sync.ballot.b32) — ~2.0× across every config. 32 threads each contribute one bit in unison via a warp-level primitive. No clean C form.
bfi.b32 for bit-packing the quantized indices — zero speedup within noise. I checked the SASS and nvcc already emits BFI from the C pattern word |= (idx & MASK) << shift. The inline PTX was cosmetic.
Takeaway: inline PTX only pays off when it exposes a hardware primitive C can't express.
End-to-end on SIFT-1M (1M × 128 vectors, standard ANN benchmark): — 93% Recall@10 at 8× compression with fp32 rerank — Naive scalar quantization at same bits: 68% — At 16× compression, naive is essentially random (6%); TurboQuant still preserves 52% of true top-10
(
@vikhyatk @StasBekman — tried to be careful separating the three "baselines" here: fp32 ceiling vs fp16 cast vs naive scalar at matched bits. If anyone spots methodology holes, I'd take the feedback.)
Paper's predicted distortion for b={1,2,4}: {0.36, 0.117, 0.009}. I measured {0.361, 0.116, 0.00933} over 10 seeds. Sits right on the predicted curve.
Full repo, all three implementations, reproducible benchmarks: →
github.com/YuvaneshSankar/CU…
@mirrokni