If you thought MIT wasn't going to drop another banger RL method this week, you were wrong.
They just released Vector Policy Optimization.
Most RL methods focus on single answer generation. VPO focuses on candidate set generation for search instead.
The model is trained to produce a batch of candidates in a single rollout separated by a delimiter. This allows the model to reason across answers within a single rollout and create diversity within a candidate set.
Think about kernel optimization.
If you ask a model for 16 possible CUDA kernels, you want real variation. Some candidates should try different tiling choices. Some should use memory differently. Some should make different tradeoffs around speed, stability, or which tensor shapes they handle well.
You want to explore the full design space instead of exploring only around a local maximum. Generating one candidate per rollout and depending on stochastic variation often leads to similar results across rollouts.
VPO argues that the RL phase should explore the search space by producing candidates with different strengths. The test-time search loop can then exploit the promising regions with benchmarks, verifiers, or evolutionary search.
VPO does this with two changes.
First, the model generates multiple candidates in one rollout, so later candidates can see earlier ones and avoid repeating the same idea.
Second, it uses reward vectors instead of a single scalar reward.
For code, the reward vector might include tests passed, speed on different input sizes, memory use, numerical stability, and hardware compatibility.
Instead of picking one fixed weighting of those signals, VPO samples many weightings. One weighting may care mostly about speed. Another may care more about memory. Another may favor stability or an edge case.
For each weighting, VPO asks which candidate in the set wins.
The set gets rewarded when it contains winners across many different weightings, which they call reward-space diversity. The model is trained to keep multiple useful directions alive, so search has something to work with later.
The ablations are important here too.
Multi-candidate rollout by itself is not enough. If you still train the set with one scalar reward, the candidates can collapse into similar answers.
Random reward weightings by themselves are not enough either. If the model still emits one answer at a time, you are changing the target but not training a useful candidate set.
VPO needs both: generate a set, then reward the set for covering different parts of the reward space.
That also explains why it helps less when rewards are colinear.
If one kernel is best on correctness, speed, memory, and stability, every weighting picks the same winner. Diversity does not buy much.
But hard search problems usually have tradeoffs. The fastest kernel may be brittle. The robust one may be slower. The version that works best on small inputs may lose on large inputs. The weird candidate may be bad now but contain the mutation path that wins later.
Search only works if the generator gives it somewhere to search.
There are many interesting areas you could apply this like model training, agent planning, auto-research, and code search. VPO is promising for any problem where the design space is large and has many different tradeoffs.
Paper in the comments