The wildest CVPR 2026 result: a video frame doesn’t need 1,024 tokens. It needs one.
“A Frame is Worth One Token” (DeltaWorld) compresses each frame to a single token for world modeling.
- Better future predictions with over 35x fewer parameters and 2,000x fewer FLOPs than existing generative world models , plus a 1,024x token reduction at 512x512 .
- A tokenizer encodes the difference between consecutive DINOv3 frames into one “delta” token. A tiny generator predicts the next one, supervising only its closest guess to ground truth. Diverse futures in a single pass.
- Why it matters: Video collapses from a 3D blob into a 1D sequence. Generative world models finally get cheap enough to actually run.