🧵 on quantitative internal evals
1/4: GPT 5.5 excels at writing design docs efficiently. I turned some design docs I wrote recently into an eval. The prompt is a list of functional requirements to achieve within a codebase, and the design doc is graded based on full, partial, or missing adherence to each functional requirement.
I’ve used GPT 5.5 for the past few weeks as an early tester, and it’s the biggest upgrade I’ve felt since o3.
GPT 5.5 is really smart for autoresearch-type jobs. It suggests tasteful research directions, implements the directions correctly, and works reliably for hours at a time. I used it to optimize our harness against an internal dataset, and eval results doubled.
I’ve been fighting some race conditions for the past few weeks. GPT 5.4, Opus 4.6, and Opus 4.7 all spun in circles for hours, playing whack-a-mole with issues while creating new ones. GPT 5.5 found the architectural issue, confirmed its approach with me, and then implemented the fix elegantly.
I use it as my daily driver for agentic coding because it actually finishes what it says it will do. When I tell it to fix a flaky test until it runs 20 times consecutively without failure, GPT 5.5 actually works on it for over 24 hours and then succeeds. Opus has never done so faithfully.