Filter
Exclude
Time range
-
Near
We put out a blog version of AutoLab in March. Today, I am excited to share that our full paper is out! 📖 AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? Since then it's been great to see work in the field building on what we shared, that's a big part of why we went back and did the complete version, with the full experiments. The question is simple: can a model stay with a hard problem for hours? 36 environments, each a real program that works, but not optimized. The model gets the code, a sandbox, up to 12 hours, and a sealed scorer -- the only way to a better number is better code. We ran 17 frontier models. 2,544 hours, 8.6 billion tokens. The finding from the blog held up: the strong models weren't the ones with the best first attempt -- they were the ones that kept closing the loop: test, change, test again! Persistence alone wasn't enough. Some models ground on for hours but barely ran the code, and the clock ran out on them. Others gave up with hours left. We built this benchmark; it might not capture everything. But we hope it fills in a piece of the picture that one-shot scores miss, especially for anyone building agents to do hours of real work! If you've been building on the blog version, we'd love to see what you found. And if you try your models on it, tell us what breaks. #AutoLabBench
1
11
31
2,700