To people in the answers saying "but opus 4.8 is weaker so without fallback, the score would even be higher": this is not necessarily true because of how any benchmark - which is an average of queries - work and what is called "the fallacy of division".
Even if Opus 4.8 has a lower average score on AA than Fable 5, it actually performs better than Fable 5 on some benchmark that compose the index of AA, especially where there's high refusal rate of Fable 5 (ex GPQA Diamond, AA-Omniscience). The same would go if you'd take a single benchmark btw as it's always an average of queries and the fact that a model has a higher score on average doesn't mean they answer better on 100% of queries.
So it's possible that Fable with Opus 4.8 fallbacks is getting a higher score than pure Fable, even if Opus 4.8 is weaker on average.
The challenge is no one knows, except the API provider, which is the challenge I'm pointing out.
More details below from Fable (or Opus?) themselves!
This graph captures what’s broken about AI evals: they structurally favor closed-source APIs that can route, fallback, ensemble, and optimize behind the scenes with no transparency.
No offense,
@ArtificialAnlys, but how is comparing one model to two models fair?