GPT-5.5 (xHigh) ranks #2 on Agent Arena ( 10.6% net improvement), making it the highest-ranked OpenAI model closely behind Claude Fable 5 (High).
Per signal breakdown, GPT-5.5 (xHigh) ranks #1 in Praise vs. Complaint ( 29.4%) and Bash Recovery ( 14.1%), scoring higher than Claude Fable 5 (High) on both signals. It trails Claude Fable 5 (High) on Confirmed Success ( 5.4% vs. 17.6%) and Steerability ( 1.9% vs. 5.4%).
Agent Arena evaluates models on millions of real-world, long-horizon agentic tasks. Models use tools like web search, filesystem, and terminal to complete complex workflows: writing code, creating slide decks, researching the web, building apps, and analyzing documents.
We use causal tracing to measure model performance across real-world agentic tasks. More breakdown of GPT-5.5 (xHigh) across five signals in the thread.
Introducing Agent Mode: Agentic AI is now measured in the Arena.
Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.
It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.
Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.