a nice straight forward summary on some of the grpo limitations beyond just not being great for multi-turn e.g.
* if you have multiple reward signals -> the model won't know which one it is being rewarded for since they're usually all collapsed into one
* only the scalar reward is used for policy update when a more detailed textual feedback could be used (what gepa kinda does with their reflective prompt evolution)
wrote a short blogpost on what I think are some limitations of GRPO:
I’ve been playing around with RL finetuning for reasoning tasks and came across a few limitations that i wanted to document here
feedback/corrections are welcome!