VLA or world model?
Standard Vision-Language-Action (VLA) models are designed to map observations (images, language instructions, and robot state) directly to robot actions in an end-to-end manner. Recently, there has been many discussions around world modeling as a paradigm shift from VLA-based approaches in robot learning. Instead of learning how one robot should move, the robots learn how the world works, making their knowledge a shared asset across bodies, hence potentially unlocking better generalizability across different environments and scenarios.