As noted in DeepSeek-R1 and other studies, RL fine-tuning has several limitations, including challenges with long-horizon and outcome-only rewards, low sample efficiency, high-variance credit assignment, instability, and reward hacking.
ES sidesteps these issues: it perturbs parameters (not actions), evaluates full rollouts, and averages over populations, thereby achieving stable, gradient-free, and reward-hacking-resistant optimization that is also easy to parallelize.