Previously, in our work on token merging/pruning methods like VisionZip, FastV, LLaVA-PruMerge, ToMe, etc., we always tried to drop / merge the redundant tokens.
LLaVA-OneVision-2.0 takes a much more native route — directly leveraging video codec knowledge to treat highly dynamic video as a continuous bit-cost stream. Surprisingly, it handles redundant information better and delivers stronger results.
This feels like a cleaner, more fundamental way forward. Really nice shift! 🔥