Claude Fable 5 (Anthropic, released June 9, 2026) generally outperforms GPT-5.5 (OpenAI) across most of these benchmarks, with particularly large leads in coding, agentic software engineering, and complex reasoning tasks. The gap widens on harder/“frontier” subsets.
Fable 5 (the generally available version with safeguards) performs very close to the internal Mythos 5 preview in non-sensitive areas. Scores can vary slightly by harness, effort level (e.g., max/xhigh reasoning), and safeguards (which sometimes cause fallback to Opus 4.8 on cyber/biology-related tasks). Data comes from Anthropic’s launch materials, Epoch AI, Artificial Analysis, Vals AI, and third-party comparisons (as of mid-June 2026).
Here’s a benchmark-by-benchmark breakdown:
Math & Reasoning
•1. FrontierMath Tier 4 (research-level): Fable 5 ~87.8–88% (Epoch AI) vs GPT-5.5 ~72%. Strong Fable lead.
•2. FrontierMath Tier 1-3: Fable 5 ~87% vs GPT-5.5 ~85%. Slight Fable edge.
Coding & Software Engineering (Fable’s biggest strength)
•3. SWE-Bench Pro: Fable 5 80.3% vs GPT-5.5 58.6% ( 21.7 points). Massive win for Fable.
•4. FrontierCode Diamond (hardest production-quality subset): Fable 5 29.3% vs GPT-5.5 5.7%. Huge lead (more than 5x).
•5. FrontierCode Main: Fable 5 ~46.3% vs GPT-5.5 ~25.5%. Clear Fable advantage.
•6. TerminalBench (2.1): Fable 5 84.3–88.0% (Mythos higher; Fable has some safety refusals) vs GPT-5.5 83.4%. Slight-to-moderate Fable edge.
•7. KernelBench Hard: Limited public head-to-head data. Fable excels on complex coding/agentic tasks overall; expect Fable advantage based on patterns in similar benchmarks.
•31. LiveCodeBench: Fable 5 ~89.8% (top-ranked on Vals) — strong lead expected over GPT-5.5.
•34. IOI: Fable 5 72.25% (top on Vals).
•36. VibeCode: Fable 5 90.35% (top-ranked).
Agentic & Real-World Tasks
•9. Humanity’s Last Exam (No Tools): Fable 5 59.0% vs GPT-5.5 ~41–50% (sources vary slightly).
•10. Humanity’s Last Exam (Tools): Fable 5 64.5% vs GPT-5.5 52.2%. Solid Fable win.
•15. AutomationBench: Fable 5 17.4% vs GPT-5.5 12.9%.
•16. OSWorld: Fable 5 85.0% vs GPT-5.5 78.7%.
•20. GDPval-AA: Fable 5 1932 vs GPT-5.5 1769. Clear Fable lead.
•21. GDPpdf (visual document reasoning, no tools): Fable 5 29.8% vs GPT-5.5 24.9%.
•22. Legal Agent Benchmark: Fable 5 13.3% vs GPT-5.5 2.1%. Very large Fable win.
•23. HealthBench (Professional variant): Fable/Mythos ~62.7–66%; GPT-5.5 trails in available comparisons.
•27. ALE-Bench (Agents’ Last Exam): GPT-5.5 has a slight edge in some harnesses (e.g., ~24% vs Fable ~22%). One of the few where GPT-5.5 competes or leads.
•28. Agent Arena: Fable leads in coding/research/document tasks per available reports.
Broader Indices & Knowledge
•11. AAI Index (Artificial Analysis Intelligence Index): Fable 5 ~65 / 64.9 (often #1) vs GPT-5.5 60.
•29. Vals Index: Fable 5 75.14% (#1).
•30. Vals Multimodal: Fable 5 74.15% (#1).
•32. MMLU Pro: Fable 5 91.50% (#1 on Vals).
•33. MMMU: Fable 5 89.31% (#1 on Vals).
•35. CorpFin: Fable 5 71.83% (#1 on Vals).
•37. ProofBench: Fable 5 77.00% (#1 on Vals).
Other / Niche Benchmarks
•8. GBAEval, 12–13. WeirdML / Reliability, 14. PencilPuzzleBench, 17. Stagehand Agent Evals, 18. PACT Negotiation, 19. Debate Benchmark, 24. ExploitBench, 25. Cyber ECI, 26. FrogsGame, 38. Public Benefits Bench: Limited or no direct public head-to-head scores yet (Fable 5 is very new). Fable generally leads on related agentic/cyber/coding tasks where data exists (e.g., strong on ExploitBench for Mythos variant; safeguards can affect Fable on pure cyber). Expect Fable advantage on most technical ones based on patterns.
Overall verdict:
Claude Fable 5 is the stronger model on the vast majority of these benchmarks (especially anything involving long-horizon coding, production-quality software engineering, complex agentic workflows, or hard reasoning). The leads are often substantial on the hardest subsets (e.g., FrontierCod