Thanks for your kind thoughts and support, Jeremy! I do believe this direction could potentially open up a new level of scaling toward deeper models with stronger reasoning capabilities.
To be clear, I was not aware of the Depth-MP work when developing this paper (shamed); the idea grew out of my years of work on model compression and layerwise analysis.
1.
arxiv.org/abs/2202.02643
2.
arxiv.org/abs/2310.05175
3.
arxiv.org/abs/2410.10912
After releasing our paper, we compared our method with Depth-P and found that the two perform similarly. I fully respect prior work, and we will expand the related work discussion and add the appropriate references in the next version.