RLM arXiv paper update: depth>1 results, more comparisons, more training, and more error analysis!
We add depth=2/3 experiments, where the RLM now has access to recursive RLM calls. This is also a feature of the open source `rlm` repo as well. We observe significant performance gains on OOLONG-Pairs and gains on all other benchmarks!
We also include various OpenCode and Claude Code comparisons now per popular request.
We add a length generalization experiment on MRCRv2 to show more promising training results, add a small prompting case study on OOLONG, and update the error analysis section to discuss the effect of syntax errors, decomposition mistakes, and general observations from the RLM trajectories.
The appendix is now also updated with several new experiments and plots!