I severely underestimated GPT-5.5. Here’s what changed.
Yesterday was a letdown. I ran one-shots, coding challenges, and benchmarks and walked away unimpressed. Wrote it off.
Today I ran everything again, this time through the direct API instead of Codex. The difference was night and day.
What flipped:
•One-shot prompts went from mediocre to genuinely impressive
•Coding tasks (pi and holefill) came back clean and sharp
•Benchmark scores jumped significantly GPT-5.5 now leads in my testing
And I think I know why the gap existed.
Codex and the direct API aren’t the same thing. The context window differs, the model routing differs, and depending on how you authenticate, you may not even be hitting GPT-5.5 inside Codex at all. My “disappointing” day-one results were probably a configuration problem, not a model problem. My mistake.
When you actually get the real model? It’s a different story. It tore through my benchmark including problems I designed to be extremely hard. I need to build harder tests.
Fixes I shipped along the way:
Two bugs were quietly distorting results for certain providers:
•Added automatic retry on dropped connections
•Removed the hard timeout cap
Turns out DeepSeek and Kimi K2.6 both wanted to think for over an hour on some prompts. Once I let them, their scores improved substantially. In my testing, Kimi now comes close to Sonnet 4.6 in quality just considerably slower.
What this means for my previous post:
Some of what I said was wrong. I’m owning that.
The updated picture is more nuanced: GPT-5.5 shows real strength in agentic coding and multi-step tasks in my runs. But this isn’t a clean sweep other models still hold ground in specific areas, and the race is still very much alive.
GLM 5.1, Gemma, and Grok are not yet updated.