🎬 Toward a Cambrian Moment for Visual Intelligence.
Can AI learn complex real-world skills from video alone, just like humans learn paper folding or LEGO by watching?
🚀 ByteDance Seed and BJTU present VideoWorld2, a simple generative model that masters complex, long-horizon real-world knowledge purely through visual data, without relying on language models.
👩🏫 As also noted by
@drfeifei, the emergence of visual capability sparked the Cambrian Explosion and enabled a rapid leap in intelligence. VideoWorld 2 explores this frontier by learning complex task knowledge directly from real-world videos. It reliably executes minute-long handcraft tasks such as paper folding and block building with far higher success rates, while current SOTA (e.g., Sora2, Veo3, Wan2.2) fail to execute them (over 70% improvement). It also demonstrates cross-domain transfer and strong scaling in robotics.
🌟 Continuing the vision of the
#VideoWorld series, we believe that visual learning offers a scalable path toward agents that acquire knowledge the way humans do — by observing the world directly.
Our main contributions are:
👉 We first explore learning complex, long-horizon task knowledge directly from unlabeled real-world videos, identifying the disentanglement of visual appearance from task-critical dynamics as key to transferable skill acquisition.
👉 We propose VideoWorld 2, leveraging a dynamics-enhanced Latent Dynamics Model (dLDM) to extract task-critical dynamics, significantly outperforming SOTA video generation models on complex real-world tasks.
👉We construct Video-CraftBench, a large-scale handcraft video benchmark to advance research in visual knowledge learning and world modeling.
Check out our paper for more details and results!
[Project Page]:
VideoWorld2.github.io/
[Arxiv]:
arxiv.org/abs/2602.10102
[Code]:
github.com/ByteDance-Seed/Vi…
#VideoWorld2 #VideoWorld #VisionLearning #Robotics #EmbodiedAI