Robotics RL finetuning recipe:
Pretrain → BC/VLA → Online RL Finetuning → Continuous Improvement.
How to efficiently improve pretrained robot policies with minimal online interaction?
• EXPO-FT
RL finetuning can make VLA models practically useful on real robots.
Instead of training from scratch, EXPO-FT starts from pretrained VLA models and performs stable RL finetuning, achieving 30/30 success on challenging manipulation tasks (string-light routing, pool-ball striking, flower-in-bottle insertion) using only ~19.1 minutes of online robot data on average. This directly targets the reliability gap between VLA generalization and deployment-level robustness, via residual action editing and Q-guided action selection.
• Q2RL (When Life Gives You BC, Make Q-functions)
Behavior Cloning already contains implicit value information.
Instead of discarding BC and starting RL from scratch, Q2RL uses a small amount of online interaction to estimate a Q-function around a pretrained BC policy, then performs online improvement with Q-Gating, dynamically selecting between BC actions and RL actions according to estimated values.
Results show up to 3.75× improvement over the original BC policy and successful on-robot learning for contact-rich assembly tasks within only 1–2 hours.
• Flow Reversal Steering (FRS)
Rather than updating policy weights, FRS exploits the latent structure already present inside flow-matching robotic generalists.
Given a "reasonable but suboptimal" action, FRS reverses the flow process to recover latent noise and then steers that noise toward nearby policy-consistent action modes.
This effectively turns semantic guidance from humans or VLMs into improved robot actions.
Even more interesting: the resulting improvements can be distilled into a lightweight policy with BC, producing success-rate gains in under one minute of training. FRS also enables RL to bootstrap from semantic priors when conventional RL fails.
• OGPO
Don't freeze the generative policy. Fully finetune it.
OGPO introduces Off-policy Generative Policy Optimization, combining:
Off-policy critics for aggressive data reuse
Modified PPO objective for full generative-policy finetuning
Backpropagation through the entire diffusion/flow process
The paper also identifies several practical stabilizers:
Success-buffer regularization
Conservative advantages
Q-variance reduction
Notably, OGPO can recover poorly initialized BC policies to near-complete success even without expert demonstrations in the online replay buffer.
• RECAP (π*0.6)
Advantage-conditioned policy learning that lets VLAs learn from successes, failures, and human corrections at scale. Then specialize with on-robot data (doubles throughput, halves failure rates on espresso, laundry, box assembly).
• RLT (RL Token)
Adds a compact learned token interface to the frozen VLA so a lightweight actor-critic head can do fast online adaptation on precise sub-skills in minutes to hours.
• DICE-RL
Treats RL finetuning as distribution contraction: starting from a broad generative prior, selectively amplifying successful behaviors through residual RL and value-guided updates, improving robustness and stability on challenging real-world manipulation tasks.
Summary:
Robotics is converging toward the same paradigm shift that happened in LLMs.
Pretraining provides broad priors, bottleneck becomes post-training.
• Sample-efficient online RL
• Value extraction from pretrained policies
• Steering latent action manifolds instead of learning from scratch
• Full-policy finetuning for diffusion/flow controllers
• Closing the pretraining–posttraining gap
π0.5/π0.6, RLT, DICE-RL, OGPO, EXPO-FT, Q2RL, and FRS are all pieces of the same puzzle:
How can robots continuously self-improve after deployment without requiring another massive pretraining run?
The next generation of robotics systems may be defined less by bigger world models and more by better post-training loops. How do we improve it after deployment with minutes of interactions?