Most of the manipulation fragility I see in the wild traces back to one mismatch: the vision encoder was trained on static images, the policy was trained on motion, and nobody on the stack is in charge of the verb.
DynaFLIP picks that exact fight.
Instead of bolting a CLIP/SigLIP/DINOv2-style backbone in front of an MLP, a Diffusion Policy, or a VLA and asking the policy to learn the dynamics from demonstrations alone, it trains the encoder itself to anticipate motion. Three signals get aligned on a shared sphere: image transitions, the language instruction, and 3D scene flow. The objective shrinks the triangle they span. A cosine regularizer keeps the triangle from cheating itself flat. InfoNCE negatives keep the three embeddings from collapsing into one point.
At deployment, the model wants a single RGB frame. The flow and language are gone. The anticipation is baked into the features.
The detail that keeps me thinking, given the egocentric-data-collection thread I’ve been pulling on (hat-cams, UMI rigs, haptic gloves): the training data is action-free video, robot and human. The supervision rides on watching the world change, which means human video walks in as training data. A different lever than collecting more teleop.
Not a VLA. It picks no actions. It is the eye the action picker looks through, with a sliver of world-model instinct smuggled into the backbone.
My article here ⬇️