WOW I did not expect these results. This is actually crazy, insightful, and completely changes my dev workflow moving forward:
A SINGLE CODEX /goal RUN IS THE CLEAR WINNER. NO ORCHESTRATION, NO OUROBOROS, JUST ONE LITTLE AGENT THAT COULD 🤯
IT COMPLETELY DESTROYED THE OPUS ORCHESTRATOR IN SPEED AND QUALITY!
Before I went to sleep, Codex 5.5 xhigh finished 1 hour in!
Full migration done, everything clean. I reviewed the PR and I am very happy.
Claude Code (Opus 4.7) was working for 5 hours at that point by the time I went to bed. I woke up, and it's still working! 13 hours! It actually stopped working because it stopped to ask me an irrelevant question.
Orchestration has never took this long for me in the past. I'm using the new CC /goal mode and auto-compacting at 25% (250k context) to prevent context rot past that point
It is STUPID SLOW (which is funny bc it's managing GPT 5.5 low, fast-mode, so it shouldn't take THAT long)
for what ended up being LOWER quality work! By a mile!
This was really surprising to me, because before 5.5 came out Orchestrating like this was the absolute best, fastest and most efficient.
And now on a large critical task, it was more than 6x slower than a single 5.5 /goal mode instance on xhigh ???
It seems compaction played a large role in the slow down here here, because Claude Code compacts at 25% (250k tokens) automatically (I set this in settings)
Everytime it compacts it has to take the time to READ EVERYTHING and then get the full context then execute and get full again then compact and oh boy it's not efficient at all.
In fact, most of it's time as the orchestrator was spent compacting and reading context then compacting again!
Then Codex would just have one long continual running compaction, and just kept moving forward. I believe my goal ledger skill plays a big role in helping it stay aligned here!
Look at this difference LMFAO:
- Codex PR #23: backend Supabase removal complete, canonical wake wired, preserved surfaces intact, typecheck/lint/tests green, dogfooded against local
Postgres, one item correctly deferred documented. Mergeable now. 4,056/−981.
- Claude attempt-1: fails the headline goal (supabase dir 9 importers still present), regressed a preserved surface (gutted task.service, stubbed tasks.router to emptyBoard — PRD-forbidden), deleted ~5,456 test lines, uncommitted/dirty. The 17,762 deletions are over-deletion, not more work.
Wow. I am actually shocked. I am so happy I ran two diff workflows on a big, identical PERSONAL problem.
This completely changes my workflow moving forward- no longer will I orchestrate a big task from the top down
Instead, I am going to now experiment with the following flow on Codex:
1. Having Codex scope our codebase, then having. aback and forth brainstorming/discussion on what needs to be done
2. Creating a master PRD from that file, and SPLITTING the work into focused branch work
3. Branching off the chat in parallel, until we get to a part where we need to merge work, then parallelize again
This way, Codex agents can work individually, every single branch will have the same research/brainstormed context, and they just work to full completion
Based off this experience, this feels like the right direction. I will never do an orchestrator in this style again (executing a PRD to completion). Instead, I will do more of... a manager of branched work.
Regardless of what I do moving forward, I will never run an orchestrator setup like this again. LMFAO
OK FIRST EVAL: CODEX RUNNING /goal
VS.
CLAUDE CODE ORCHESTRATING CODEX AGENTS
I have an ACTUAL long form tasks I have to finish. I created two separate worktrees
This one is a full migration of services from Supabase to self-hosted Postgres instead, dogfooded, e2e tested
I am curious if Codex (NOT orchestrating subagents, but doing work itself as a single agent) on xhigh will perform better than Claude Code (Opus 4.7, high) orchestrating an army of Codex Agents (5.5 low)
I'll be judging these based on
- did you do the thing i actually wanted
- how long did it take
- how much did it cost
- which output is higher quality
I never had incentive to do this because the best workflows were obvious but now it's not and I feel lost again 😭
Will run this overnight and see which one does best and report results !