The biggest reduction in KV cache memory comes not from quantization or MLA, but from latent compaction, along the sequence dimension.
More strong results coming soon with Attention Matching.
We introduce a new approach for fast and high-quality context compaction in latent space.
Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.