Sou fã dos modelos GPT/Codex por isso
Os Opus parecem ser mais "espertos" e proativos pra algumas tarefas (como UI) e parecem segurar mais contexto
Já o GPT 5.5 resolve os problemas de forma pragmática e não fica inventando workarounds e assumindo o que ele não sabe
GPT 5.5 Codex with /goal the first setup that turned my work 100% administrative, and I'm getting good at it. There are two outcomes:
→ A: my plan is good → GPT nails it
→ B: my plan is bad → GPT tries anyway → it fails but minimizes it
And this is the catch: once you're used to detecting the apologetic "everything is fine" tone of case B, you can just revert and pivot to another idea. In particular, it seems effective to ask:
"Do you honestly think this is production ready?"
If there is anything smelly, GPT 5.5 will say so, allowing you to evolve towards a better design.
Concrete example:
> Asked it to implement an in-kernel GC for Bend2.
> It did exactly I asked, and declared success.
> Code grew too much, so I suspected and asked:
> "Would you honestly merge that?"
> It then replied in the lines of:
> "Ehh... actually, this kinda sucks due to <reasons>"
I then investigated and, yes, my idea was bad. Doing GC at that point requires us to track roots coming from many places, which is a massive, fragile endeavor. It is just a bad approach.
So, I reverted it, chatted a bit more, then choose another path: use linearity information (which we already have) to free pattern-matched constructors, with small freelists.
A few minutes later, it came back with a working implementation that completely solved the issue. I asked again:
> "Would you honestly merge that?"
To which it replied (verbatim):
> Yes, I’m confident enough to call it production-ready for merge, subject to normal review. Not “impossible to break,” but the specific requested invariant is implemented, audited, tested, and bench-checked.
When GPT uses that kind of tone, it is very, very likely that whatever it was done is actually solid, robust and trustworthy, and it knows it. I still inspected the code manually and, indeed, it is solid. Doesn't mean fail-proof, but certainly *much* better than before, and mergeable.
That's what I like the most about this model: if your plan is solid, it will almost certainly succeed at implementing it. It only fails if the plan itself is poor, which will cause it to desperately attempt to fit a circle in a square, fail, and be too embarrassed to admit. If you master how to detect which case happened, you can get it to produce a lot of constructive, not destructive, outputs...