Completely agree with the thoughts here. Every major ML sector has been solved via a well-developed flywheel and embodied AI just doesn’t have one yet. Every hour of robotics teleops/UMI data was paid and worked for— we’ll never reach common crawl scale this way.
If you’ve been paying attention closely, you’ll notice a lot of the people in world models are coming from AV… it’s because the AV flywheel has gotten so good that it’s essentially internally solved.
Also regarding the arch points here, it’s undoubtedly true. Language-based backbones have no place in action policies. Language is a hyper-compressed representation of reality. In no way is it possible to truly understand the dynamics of the world through language alone.
undoubtedly, world models > VLAs
here’s why world models are winning and what it means if you’re building in robotics:
1. VLAs worked around the robotics data gap by bolting robot actions onto a vision-language model. robot data is catching up. there’s no problem to work around anymore.
2. world models understand physics, not just pixels. space, motion, causality, affordances. a VLA sees an image and predicts an action. a world model simulates what happens next and plans through it.
3. the data flywheel finally makes sense. the robot collects data, the model gets better, the robot gets better, repeat. 1X, Generalist, and π0.7 are all converging on this loop.
4. you don’t need millions of hours to start. 1X trained on 900 hours of human video 70 hours of robot data. architecture and data quality beat volume.
5. for robotics founders: fine-tuning off the shelf is fast but it’s a ceiling. the teams training from scratch on their own data are compounding. that gap only grows