Uncommonroute trained router matches Claude Opus 4.6 in SWE-bench Verified evaluation.
In TwinRouterBench you test realism agentic trajectories and the results are staggering:
Uncommonroute Trained vs Claude Opus 4.6
75/100 vs 74/100 (matched in resolution)
$25.66 vs $54.73 (53% cost saving with Uncommonroute trained)
Models cost more now with advanced reasoning and agentic tasks, time to save to get same quality at a better price.
TwinRouterBench is a new benchmark designed for step level routing in long horizon, multi turn agentic workflows. Differing from traditional routing benchmarks that focus on single prompt routing, TwinRouterBench evaluates how well a "router" can choose the right model for each individual step of a complex task.
It implements dual tracks evaluation between fast development and realistic testing:
Track 1: Static Track (Fast Offline Track)
• 970 router visible prefixes from 520 trajectory instances.
• Covers 5 diverse benchmarks: SWE-bench, BFCL, mtRAG, QMSum, and PinchBench.
• Each example comes with an execution verified target tier (cheapest sufficient model tier).
• Uses deterministic scoring (based on tier correctness, trajectory membership, and token cost) no LLM judges needed.
Ideal for: training routers, rapid iteration, and cheap offline evaluation.
Track 2: Dynamic Track (Live Validation Track)
• Full evaluation harness on SWE-bench Verified (500 tasks).
• Reports results on a 100 case heldout split (disjoint from static data).
• Router must choose a real model from a locked pool at every step.
• Measures real outcomes: Official task resolution success, Actual API spend (real dollars), Includes failure penalties for unresolved tasks
By providing both a Static (fixed) and Dynamic (flowing) track, TwinRouterBench solves the problem where a router looks good on paper but fails when the agent actually has to live with its choices.
TwinRouterBench is set for the agentic era where every step is measured in routing vs just one shot prompt testing. This benchmark targets the realism distortion by testing routing within the actual context of multi step, stateful agent trajectories.