Interesting new SWE/agentic benchmark (DeepSWE) was released yesterday. 113 tasks across 91 repos in 5 languages. Here are interesting things I noticed:
- The evaluation harness (mini-swe-agent) gives every model a single bash tool and the same SI. No vendor editing primitives.
- Eval Prompts are shorter than SWE-Bench Pro, but require 5.5Γ more code and touch 7 files on average. The idea is to mimic how developers actually talk to agents, short behavioral descriptions, not verbose specs.
- SI describes a specific workflow: find code, reproduce, fix, verify, edge cases, submit. This maps directly onto how the verifier grades, which could bias toward models that follow instructions literally over models that explore more.
- The bash tool is guarded, outputs over 10k chars get truncated. Malformed tool calls get caught and retried with guidance rather than crashing. To prevent to blow up context.
- Mini-swe-agent claims to match or beat 1P harnesses on the same tasks. Claude Opus scored 10pp over Claude Code. Gemini 3.1 Pro scored 20pp over Gemini CLI.
Would love to see how other harness Γ model combinations will do, e.g.
@cursor_ai,
@antigravity,
@FactoryAI and how well the eval harness does on more general knowledge work, e.g. GDPval.
Great to see the SWE-agent team keep pushing on both the research and eval side. π€