3/n
VILA learns compact, view-invariant latent actions that capture state transitions, instead of a whole scene representation.
By focusing on action-relevant dynamics, VILA avoids spending encoder capacity on modeling the entire scene, leading to more robust policy learning.