Most simulation benchmarks for VLAs cannot tell you whether their numbers map to reality. REALM can: p < 0.001 correlation with real-world rollouts across 7 manipulation skills and 5 perturbations.
The sim-to-real gap has been the central reason I have argued for collecting real data wherever possible. Most simulation benchmarks tell you something, but you cannot tell whether that something maps to reality. REALM, from Martin Sedlacek and the team at CTU Prague and Amsterdam, takes that problem seriously.
The team built a simulation environment designed to correlate with real-world performance, and then validated it. Pearson values close to identity on task progression curves. Attention maps from ฯ0 show 0.85 cosine similarity between matched real and simulated frames. They did not skip the validation step. They led with it.
That changes what the simulation results actually mean. Across 15 perturbation factors covering visual, semantic, and behavioural variation, ฯ0, ฯ0-FAST, and GR00T N1.5 all show noticeable performance drops under semantic perturbations despite their internet-pretrained VLM backbones. All show sensitivity to camera viewpoint despite training on DROID's unusually diverse viewpoint distribution. The hardest axis of generalisation is across objects and their properties, not across skills. Reliability under perturbation is low across all three models.
If the sim correlates with reality at the level REALM demonstrates, these are not simulation artefacts. They are real failure modes that real teams should be planning around.
Two things this tells us.
Validated simulation has a role in evaluation that it does not yet have in training. The cost of running thousands of perturbed rollouts in the real world is prohibitive. If REALM's correlation holds up across more task families, sim-based evaluation could become a serious tool for surfacing failure modes that ad-hoc real-world testing misses.
The failure pattern across all three tested models also points back at the same place it always does. Pretraining buys you semantic grounding and skill primitives. It does not buy you robustness. The next generation of training data needs to focus on demonstrations where the object, scene, and viewpoint move underneath the skill, not on more demonstrations of the same skill on the same object.
Paper link in comments.