Static benchmarks go stale the day they ship. So we built one that doesn't.
PeerRank: 12 models write the questions, answer them with live web grounding, and grade each other — no humans, no gold answers. The rankings hold, and they agree with Elo.
New paper →
arxiv.org/abs/2602.02589