The data Physical AI needs is changing.
For years, the focus was on locomotion and basic manipulation, walking, running, grasping. With the rise of VLA (Vision-Language-Action) models, the bar has shifted. Robots now need to see, reason about process, and predict outcomes. The same action means little without the why, the where the eyes went, and the intent behind it.
ZenO doesn't stop at collecting first-person video. When footage comes in, ZenO Studio reconstructs the head trajectory as 6-DoF pose, computes its spatial relationship to hand poses, and turns the whole thing into a format a robot can actually learn from.
Output ships in LeRobot v2.1, drop-in compatible with Physical Intelligence π0, ALOHA ACT, and similar pipelines.
Teleoperation produces clean action data, but it can't capture the environmental variability and failure modes a humanoid will actually encounter. That's why we build datasets from everyday human activity. Five minutes of someone cooking becomes a dataset where gaze, hands, and intent are baked in.
Through our app, ZenO is building infrastructure where anyone with a smartphone can produce high-quality training data. Training data for the humanoid era no longer belongs only to expensive labs.
Physical AI learns by watching us. We turn how we watch into data.