The most relevant work I could find uses pretty old models:
1. Self-Refine (Madaan et al. 2023), uses GPT-4. Finds quality increases in simple problems over a few iterations, then plateaus.
2. Huang et al. (ICLR 2024), "LLMs Cannot Self-Correct Reasoning Yet", uses GPT-4t. GSM8K grades get *worse* with self-correction.
3. Telephone game (Perez et al., ICLR 2025). Uses GPT-4o-mini. But this is just *repeating* stuff, not optimizing something.