The benchmark includes comprehensive evaluations of GPT-5, OpenVLA, Pi0, Magma, and other leading models - with open-source adaptations enabling testing on tasks far outside their original design.
Results show even our most advanced models struggle with true generalization.