The bottleneck of frontier robotics isn’t compute, labeling, or the models themselves.
It’s data collection.
While language models scaled effortlessly on open internet text, robotics requires physical trajectories, motor torques, and tactile forces that cannot simply be scraped from a webpage.
Every token has to be fought for.
Here is a breakdown of the 7 data types shaping the industry today, each representing a trade-off between collection cost and action-label purity:
1. Real Teleoperation (AgiBot World, DROID). Collected by humans guiding hardware, it scales linearly with human hours.
2. Low-cost Capture (Mobile ALOHA, UMI handheld). It drives collection cost down while keeping real physics, though it introduces an embodiment mapping problem when transferring human hand actions to robotic joints.
3. Fleet / Deployment Data (Tesla Optimus, Figure). These are trajectories from robots already working in the field. Tesla is betting its automotive fleet infrastructure transfers to Optimus. It generates powerful, real edge cases, but requires scaled deployment.
4. Simulation (NVIDIA Isaac Sim, Genesis). While offering near-infinite scale, the sim-to-real gap still struggles to model contact-rich dynamics like slipping, twisting, friction.
5. World-Model Synthetic (NVIDIA Cosmos 3). NVIDIA just shipped Cosmos 3, which natively outputs action trajectories, not only video pixels. If a world model can accurately simulate the laws of physics natively, it reduces the need for manual teleop data drastically.
6. Egocentric video (Ego4D, Meta’s Project Aria). First-person human video captured with head-mounted rigs. Far more scalable than teleop and closer to a robot’s own viewpoint. Still carries no robot action signals on its own.
7. Internet video (Youtube, TikTok). Maximum scale, lowest cost, effectively free. It captures the widest range of objects, tasks and physical situations, but with zero action labels and (mostly) a third-person viewpoint.
Collecting data is only the step one.
The next great execution challenge is engineering a coherent training recipe that can blend these heterogeneous data sources into a single model.