Local agentic 'Tool-Call Benchmark' between DeepSeek-v4-Flash to Step-3.7-Flash.
Same host, same 69 scenarios, two models.
Results:
DeepSeek-v4-Flash:
90/100 quality, 59 passed, 6 partial, 4 failed
Step-3.7-Flash:
87/100 quality, 55 passed, 10 partial, 4 failed
π