Tired of tuning PPO or blaming it on reward, task design, etc.? Introducing EPO -- our second (and hopefully final :) attempt at fixing PPO at scale!
Contrary to intuition, as the batch size or data increases, PPO saturates due to a lack of diversity in sampling. We proposed a solution in SAPG (
sapg-rl.github.io/) by incorporating an ensemble at training (not test time), but the variance was high. Our latest paper, EPO, fixes this issue with interesting insights! Try it out.
(1/n) Since its publication in 2017, PPO has essentially become synonymous with RL. Today, we are excited to provide you with a better alternative - EPO.