We have benchmarks for coding, math, and general reasoning, but few test how well a model can autonomously research in a realistic setting.
Quant research is a perfect stress test: noisy, statistical, code-heavy, and easy to fool yourself.
We recently put this to the test.