A note on intent.
We care a lot about accuracy and fairness, but we’re not building a leaderboard or ranking models against each other. The Grader exists to surface issues in our agent system: bad prompts, broken tool contracts, drifted integrations, infra flakes, regressions from our own deployments. Per-model scores are just a debugging signal, not a benchmark. If two judges score a response "poor" on the same messageId, we don’t learn that one model is better than another. We learn that something in our pipeline produced a bad answer, and we need to fix it.