So many great releases/papers today!
Can a vision model learn to see with no augmentations, no masking, no cropping, no reconstruction?
🎬 It can! Introducing Temporal Difference in Vision (TDV), a new visual representation learning paradigm built on a single assumption: the past causes the future.
TL;DR :
- We introduce TDV, the first approach to learn useful representations without any augmentations, masking, cropping or pixel based reconstruction.
- TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks
- We also show that as data scales up, weaker assumptions work better.
🧵Thread: