After testing GPT-5 (Pro subscription) since launch on real work (coding, research, prod reviews), here’s a straight, no-fluff take.
TL;DR
GPT-5 Thinking is the best “facts web synthesis” model I’ve used so far. GPT-5 Pro feels like a staff/principal engineer doing risk reviews. Codex CLI is now the default implementer; Claude Code is the reviewer. Hallucinations are near-zero when used correctly. Not a silver bullet; still needs tests and discipline.
Precision & Web
•Retrieves, verifies, and compresses web info with very low hallucination rate.
•Better than “Deep Research”-style flows tried before: faster to the point, more signal per token, fewer detours.
•o3 was already elite at reasoning; GPT-5 Thinking is o3 , with deeper nuance and tighter sourcing.
Model Routing
•The picker/router was confusing early on, so GPT-5 Thinking is the default for anything non-trivial.
•The non-thinking variant is only used for universally known facts (“Who was Marcus Aurelius?”).
•Non-thinking struggled on math/physics stress prompts (5 fails on a standard test prompt used across models). It’s not the right tool for formal derivations.
GPT-5 Pro = Production Guardian
•Workflow: build an implementation plan with Claude Code GPT-5 via Codex CLI (after giving repo context) → hand the plan to GPT-5 Pro.
•What happens: GPT-5 Pro spots production-grade failures before they happen: race conditions, idempotency gaps, edge-case input handling, flaky retries, concurrency pitfalls, security regressions.
•The difference: not generic “lint”; it flags the exact line of failure and the real-world blast radius (e.g., webhook replay partial DB commit = phantom charges). That’s principal-engineer-level scrutiny.
•Hallucinations were effectively zero in these reviews; citations and reasoning held up under adversarial checks.
Codex CLI vs Claude Code
•Codex CLI is slower than Claude Code at times and can feel conservative, but it’s more solid and avoids nonsensical diffs.
•Best pattern found: Codex CLI as the implementer, Claude Code as the second-opinion reviewer focused on clarity and refactors. Net effect: fewer regressions, cleaner merges.
Props to
@embirico for keeping in touch with the community and
@OpenAI for giving subscription usage instead of just api. This product has improved significantly in such a short period of time.
Where GPT-5 Thinking Shines
•Web-backed briefs, competitive scans, RFC-style design notes, failure-mode analysis, and “compress the internet into what matters” tasks.
•It consistently catches the subtle stuff o3 sometimes missed and keeps the write-ups crisp.
Limitations & Caveats
•Non-thinking ≠ math engine; use Thinking/Pro for formal reasoning or back it with a CAS/test harness.
•Speed can vary; don’t block delivery on a single long run but stage work and keep tests green.
•Never outsource judgment: enforce idempotency, add invariants, run chaos/replay tests, and treat outputs as proposals until the CI proves them.
Verdict
This is the first time an LLM actually felt like a staff/principal engineer on call 24/7. For this use case, shipping reliable software with real stakes, GPT-5 is an upgrade over o3 in depth, subtlety, and truthfulness. Expectations exceeded.