We built Agent Judge to evaluate long-horizon agents.
As agents take on longer tasks, the evidence needed to evaluate them gets buried across tool calls, retries, logs, database updates, and final outputs.
Evaluating these agents requires investigating the trajectory, not just judging the final answer.