2/ A core issue with parameter-only RL is that it forces task-specific learning into the model weights. Traditional RL can improve model performance on the current task, but it also tends to shift behavior away from the base model, increase forgetting and reduce plasticity. On the other hand, prompt optimization alone has the opposite limitation, as it is fast and cheap, but usually not enough to match the gains from weight updates.
The paper introduces Fast-Slow Training (FST). FST splits adaptation into two co-evolving channels:
Slow weights (θ): the model parameters, updated by RL Fast weights (Φ): a population of prompts, evolved by GEPA
In FST, context is updated from rich textual feedback, while RL updates the model more gradually. Each round interleaves a GEPA reflection cycle — a reflection model rewrites prompts from failure traces — with a few RL steps sampled across that prompt population. Both channels optimize the same reward, concurrently. No parameter freeze, no sequential hand-off.
This lets task-specific lessons move quickly through the fast channel, while preserving more of the base model’s general behavior in the slow channel.