Made a slop diagram to show something benchmarks have never captured: a smarter model tends to pick better first steps in an agent trajectory, which can lead to an entirely different outcome.
I wonder if GitHub would save money and have fewer outages if they put a cheap model in front of every GitHub Action that gated execution on: "does this look necessary"?
We've got AGI, but the Google Home app can't change the temperature in your home without first showing a stale UI state, ignoring your initial taps, then latently processing maybe 1 of them.
It's not necessarily bad, but it invariably becomes vibe coding. Harnesses like Codex, Claude Code, and OpenCode are all abstractions that hide signals humans need to evaluate code.