"A benchmark is only as good as its verifier."
To me, the verifier audit is the coolest part of the release!
I hope everyone includes similar sections when presenting new benchmarks in the future.
And I will hopefully have something similar/more to share soon🤞 (tomorrow even?)
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.
On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.