Static math benchmarks saturate. We built one that doesn't.
Announcing MathDuels, the first self-play math benchmark.
Every frontier LLM writes problems for the others, and is graded on the ones written for it. As models improve, so does the benchmark.