I keep getting asked: why are we still evaluating VLAs on LIBERO?
My take: it's adoption. So many people benchmark on LIBERO that the community follows suit. Yes, 98.2% vs 98.7% is hard to distinguish, but it still tells us something about policy efficiency, at least for quick fine-tuning on that domain and those embodiments.
Better benchmarks have emerged since. MolmoSpace, for instance, evaluates more than just VLAs. And even though MolmoAct2 currently tops it among VLAs, there's still plenty of headroom for other approaches like TipTOP (Open-world FM TAMP) to push past it. The ceiling is nowhere near saturated, which makes it great for evaluating at scale in simulation.
Robotics evaluation is far from solved, and no evaluation is perfect yet. But that's exactly what makes the field exciting with so many open questions still on the table.