Humans and animals reason about events spanning days, weeks, and years, yet current CV systems live largely in the present.
Introducing Memory-Consolidated ViT, whose context extends far into the past and sets a new SOTA in long-video understanding with a 10x smaller model