People forget that when a new model say Opus 4.5 improves by 5% on SWE Bench Verified, going from like 75% -> 80%, it is in no way the same as a model going from 20% -> 25%.
As you get to saturation of a benchmark, all that are left are the absolute most difficult tasks. That is why Opus 4.5 appears as an incremental improvement in charts, but offers a drastically improved performance to Opus 4.1 or Sonnet 4.5.
Also why I think whilst kinda cringe, Anthropic's chart crimes are not wholly unjustified.