What happens if you give your everyday ChatGPT:
a camera, a robot arm, and a simulation where it can go nuts?
Not exactly ChatGPT, but SIMPACT asks a very similar robotics question.
The current bet is VLAs: models that turn images and instructions directly into robot actions.
They look strong, until the task needs physical judgment.
Push this carton forward, but don’t topple it. Shape this rope into a U.
These are not language problems. They are data or contact problems.
SIMPACT takes an off-the-shelf VLM and gives it test-time physical reasoning.
From one RGB-D image, it builds a rough physics simulator.
Then it samples action plans, rolls them out, and lets the VLM revise before the real robot moves.
There is no training at all!
If the carton falls in sim, the model can reason:
push lower, change contact point, try again.
That is the interesting shift.
Not bigger data, nor another policy trained on more demos.
Inference compute spent on physical rehearsal.
Real robot intelligence may need test-time compute, not just bigger models.
Which, in some way, this paper proves.
On 7 fine-grained manipulation tasks, SIMPACT reports 40-90% zero-shot success.
π0.5, one of those VLAs everyone has been obsessing over lately, gets 0% on all of them.
The limits are obvious: single-view 3D reconstruction is fragile, sim-to-real is never free, and the planning loop is slow.
Robots may need less “answer instantly” and more “simulate before you touch the world.”
Congrats on the paper
@ShaoxiongYao,
@HaonanChen_,
@WinstonGu_ and the rest of the team.
Excited to share our CVPR work: SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models, 11:45 PM – 1:45 PM at ExHall F 611
simpact-bot.github.io/
How can we make VLMs plan robotic manipulation actions with grounded physical reasoning?