1/
We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do Transformers need three separate Q, K, and V projections in the first place?
Turns out, they don't. Merging them unlocks massive memory savings. 🧵