Spent the last few days tuning Hermes with Codex / GPT-5.
Not by adding one magic prompt, but by inspecting real failed conversations: token blowups, tool loops, fake “still working” replies, missed CLI paths, memory vs skill confusion.
The pattern became clear: the model matters, but the agent runtime matters just as much. A smart model can still waste 40k tokens if the system lets it drift; a weaker model can look much better if the workflow gives it sharp recovery rails.
We patched Hermes to detect debugging drift, repeated tool failures, bad handoffs, stopped “I’m processing” replies, OpenCLI path recovery, and unverified skill claims.
It feels less like “prompt engineering” and more like raising an agent: watch what it actually does, catch the bad habits, turn the lessons into runtime behavior, then test again.
Still messy. But it’s getting smarter in the only way that counts: from real scars.