[Accidentally deleted this earlier, reposting] 😭
#AutoLab #autoresearch
We've been asking ourselves a question: if AI agents can now run hundreds of experiments overnight, how do we know whether they're actually contributing to research — or just generating noise?
That's why we built AutoLab (
autolab.moe/blog).
Not another pass/fail benchmark, but an open-source environment where agents face the same loop every researcher knows intimately — propose, test, fail, diagnose, revise, repeat.
23 tasks with no answer keys, just open search spaces and real constraints.
We ran 161 evaluations across 7 frontier models, 633M tokens. Every decision, every pivot, every dead end — all openly available in our Live Lab for anyone to replay and learn from.
What we found wasn't about which model is "smartest." It's about a capability we call closed-loop resilience: when incremental refinement stops working, can the agent recognize it and restructure? On one task, two frontier models hit the same wall. One kept pushing within the existing frame. The other stepped back and redesigned the approach entirely. That moment — knowing when to abandon a frame, not just optimize within it — is what separates real research from sophisticated pattern matching.
We believe this matters beyond benchmarking. If agents are genuinely entering the research loop, we want that transition to be measured transparently, built in the open, and shaped by the community — not locked inside any single lab. The scientist doesn't disappear. The loop gets a new participant. And we want to make sure that participant is understood.
This is a joint effort across
@Stanford,
@MIT,
@UW,
@UCSanDiego,
@ucsantabarbara,
@NotreDame, NUS,
@Google,
@NVIDIA,
@IBMResearch, and
@bakelab_hq. But 23 tasks is just the start. If you have an optimization problem you've spent weeks grinding on empirically — with a clear metric and no known optimal solution — it probably belongs here. Contribute a full task, a rough skeleton, or just the idea.
The best benchmarks aren't built by one team. They're built by the people who actually do the work!
Github:
github.com/autolabhq/autolab