Robot action models shouldn't need 256 vision tokens per frame.
Pi0.5 spends 400M parameters on SigLIP just to see. We replaced it with a 4.4M encoder that
outputs 5 tokens — and action quality barely changes.
91x smaller. 51x fewer tokens. 7.3x faster inference.