Maybe I had too high hopes of finding novel ideas that powered GDM’s IMO gold and would change my perspective, but that isn’t the case and my earlier views largely stand. The biggest takeaway is that their ProofAutoGrader shows strong correlation with expert graders. Skimming the datasets they look usable, though to my taste there’s a bit too much algebraic inequalities and computational geometry. Thanks to their “robustification,” they’re a decent starting point for evals. I’m actually surprised they made these datasets public and they even note long-term data contamination as a limitation, so the benchmark’s longevity may be limited. Also their “hard” problems are hard in the time-limited IMO sense, but don’t really resemble real-world hard problems.
Google DeepMind release:
Towards Robust Mathematical Reasoning
Introduces IMO-Bench, a suite of advanced reasoning benchmarks that played a crucial role in GDM's IMO-gold journey. Vetted by a panel of IMO medalists and mathematicians.
IMO-AnswerBench - a large-scale test on getting the right answer
IMO-ProofBench - a next-level evaluation for proof writing
IMO-GradingBench to enable further progress in automatic evaluation of long-form answers.