DeepSWE and DeepSeek V4 Pro: surely the benchmark is designed to facilitate GPT 5.5, and mini-swe is not ideal for many models tested, but personally I executed it, against DeepSeek official API, using V4 Pro and reasoning Max. It used around 1B tokens and I've got only 5.3% at the end 😢
Estimated cost:
- cache-hit input: $3.54
- cache-miss input: $3.74
- output: $4.89
- total: $12.18
Without cache-aware pricing, the same token volume would look like about $433.76 😱, so cache accounting is essential here.
Bottom Line from AI analysis:
This is a clean direct-DeepSeek run from an infrastructure/methodology standpoint: no OpenRouter ambiguity, no Docker setup failures, no missing thinking metadata, and no retries. The result is low: 6/113 = 5.31%, with 3 agent timeouts counted as failures.
I invite others to check what's going here in details, results seem really odd, but code is there, verifiers is there, all in the open. I'm retrying now with reasoning High and later I'll try with a different harness.
Composer 2.5 with Grok Build gave me ~10%
We need to keep investigating. I'm testing MiniMax M3 now.