Great work from GoogleResearch on TurboQuant. Strong results — 3-bit KV cache quantization, 8× attention speedup, zero accuracy loss. Solid theoretical foundations.
Worth noting the distinction: quantization optimizes what happens inside the model. .serva operates at the data layer — before the model ever sees the input.
.serva is universal and lossless. When downstream tasks are unknown — which they often are in general AI pipelines — you cannot know in advance what information will matter. We preserve everything and defer relevance to the learning system.
We're also operating at a different layer entirely: ~44× speedup at the data layer in fine-tuning. We’ve built across any model, at any stage — pretraining, fine-tuning, inference — with no retraining required.
The efficiency stack is being built from multiple directions at once. That's a good sign for the field.