😋From WAM to WPAM: World-Action Models should NOT stop at pixels!
We release PointAction, lifting World Models from RGB to RGB XYZ and using dynamic pointmaps as universal action representations for robot control.
Page:
oriontmt.github.io/pointacti…
Paper:
arxiv.org/abs/2606.03943
🧐Q: Why not pixels only?
Pixels tell us what changes, but not always how a robot should move in 3D. Learning this mapping from RGB alone often requires massive paired action data, while raw motor commands are embodiment-specific and less transferable across robots.
Our intuition: World-Action Models should model the physical world in the same space where actions take effect — and that space is 3D.
Instead of predicting raw motor commands directly from video, PointAction uses 3D point dynamics as a richer and more robust bridge 🌉:
- they make metric motion, spatial constraints, and contact-relevant geometry explicit;
- they are less tied to a specific robot’s motors;
- they can be extracted from much broader robot video data.
PointAction first learns a general diffusion-based 4D world-action backbone in RGB XYZ space, predicting robot-centric 3D point dynamics, then decodes them into embodiment-specific controls with lightweight action heads.
---
This project is led by
@TongMutianTMT (incoming PhD student at
@PennCIS @GRASPlab), and
@hanjiang00 (talented undergrad who visited my lab last year). Huge congrats to all coauthors
@WindStyle1459 and
@LingjieLiu1! 🎉
🚀 Excited to share PointAction:
A new Video-Point-Action Model that uses dynamic 3D pointmaps as a universal, geometry-grounded action representation for robot control.
VLA → VAM → ? Lift RGB to RGB XYZ, then decode robot-specific actions.
arxiv.org/abs/2606.03943
[1/6]