Today's humanoid robots are beasts in controlled environments.
You can see that in our posts. Send one humanoid out to buy groceries and it falls apart.
It can't reliably climb your stairs, cross a cracked sidewalk, or hold a line going downhill.
Most of that gap isn't the hardware. It's the control policy.
These policies are reactive: the robot maps what it sees right now straight to an action.
No model of what its own body is about to do next.
ParkourFormer goes straight at this. The idea in plain terms:
before acting, it predicts its own body state a couple of frames ahead.
Then it picks the action based on that forecast.
Reaction turns into anticipation.
A small prediction head forecasts the next two proprioceptive states: joint angles, velocities, balance.
Those predictions get fed directly into the action head.
The robot moves based on where it expects to land next.
Not just what's under it right now.
How much does this matter?
Remove the supervised future-prediction loss in training and descending-stairs success collapses from 95% to under 10%.
When you can't see the ground under your next step, anticipation is the whole game.
The headline numbers, on a Unitree G1 with 29 joints, across nine terrains:
93.85% average traversal success. Up to 42.73% above strong MLP, MoE, and vanilla Transformer baselines.
The biggest margins show up on the hardest terrains.
One unified policy for stairs, gaps, slopes, rough ground, and obstacles. No per-terrain tuning.
The honest caveat:
this is still staged terrain, and it leans hard on RGB-D depth. Cut the depth feed and the policy fails completely.
So not grocery-run ready.
But "predict your next body state, then act" is the kind of shift that actually narrows the lab-to-street gap.