I think this also then makes the battle against archs back on
Mimo using a 1:7 (or 1:6 can’t remember) ratio with SWA is iiuc a similar kv cache saving to DeepSeek v4 (not considering the attn hparams), and assuming DSA on the full layers works well
Also with KDA, could u push even higher ratios (although tbh idk how prefix caching works with KDA but guessing it works well)?
Question I guess is if ur getting better or worse model performance compared to the DeepSeek hybrid arch
As an update and convos with many others it did seem like the kv cache reduction is mainly to help hit rates with prefix caching, and for P/D kv transfer overhead
The improvements here are mainly for deployment and not for RL efficiency
Tbh I just didn’t know the prefix caching and hierarchical kv cache offloading was that much of a bottleneck but does seem quite important here tbf (so helps with being able to store more on layers higher up and evict less)
Also for prefill/training I kinda assumed DSA didn’t help much as the sparsity is dynamic but I’m pretty sure I’m wrong here (tbf never seen DSA training or prefill speed comparisons but haven’t looked much)
So the decisions do make sense for these reasons, although I still think the fact that kv cache size is now a bottleneck for prefix caching and P/D transfer compared to for decoding as before is an interesting change so might be more potential to do stuff here