This seems to mostly be "easy-to-medium tasks any model could do, but who implements things the most elegantly and idiomatically", and it makes sense Opus 4.8 wins that. For "make this sprawling difficult complex change in this big codebase", I think GPT-5.5 will generally win
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40 hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?