One of the hardest parts of building self-improving agents is proving they are actually improving.
That’s why, alongside Duet Autopilot, we built DuetBench: the first benchmark designed specifically for CX agents that learn and improve over time.
To evaluate Duet Autopilot, we compared its performance against certified human agent builders and graded both on outcome and methodology across 90 diagnostic investigations from simple metric lookups to root-causing CSAT drops.
We also evaluated Autopilot on enterprise agent-building tasks. Starting from messy design documents, it had to build AOPs and tools from scratch, generate simulations, and pass every associated test before a task was considered complete.
Autopilot demonstrated an iterative approach to agent building. Rather than solving problems in a single pass, it ran simulations, identified broken branches, repaired the AOP or underlying tool, and repeated the process until the workflow passed.
Another notable result was that Autopilot improved the quality of its own test set through self critique, increasing simulation accuracy from 58% to 88% across 520 benchmark runs.
As self-improving systems become more common, verified evaluation will matter just as much as model capability.
Excited to share the research behind it. Full writeup below. ↓