> be anthropic engineer
> realize long-running agents still have goldfish memory
> every new context window = new intern who forgot everything from yesterday
> project goes from “build a clone of ChatGPT” to “why is half the frontend missing again?”
> agents try to one-shot entire apps
> run out of context mid-feature
> next session wakes up like “boss… who touched the router folder? why is the server on fire?”
> other times claude walks in
> sees 3 buttons rendered
> declares the whole project complete
> packs up its laptop
> goes home
> humans don’t work like this
> engineers leave breadcrumbs
> notes, git commits, tests, todo lists
> “here’s what I did, here’s what’s next, don’t break the login page again please”
> so anthropic builds a harness based on that
> two-agent setup: initializer agent, coding agent
> initializer agent = the senior dev on day one
> sets up:
> – _init.sh
> – claude-progress.txt
> – feature_list.json (200 features, all marked failing)
> – the first git commit
> basically: “here’s the blueprint, don’t get cute”
> coding agent = the worker bee
> every session:
> – read the progress
> – read the git log
> – read the feature list
> – pick ONE feature
> – implement it
> – test it end-to-end as an actual user
> – commit code
> – leave notes
> – do NOT break anything, or revert yourself
> incremental progress > chaos
> and forcing the agent to act like a real engineer = night and day difference
> testing was the big “aha”
> claude kept marking features done that absolutely were not done
> (“unit tests pass” ≠ “the app works”)
> give it browser automation puppeteer
> claude suddenly starts catching bugs it introduced 5 minutes ago
> screenshots, clicks, actual user flows
> end-to-end or bust
> limitations still there
> puppeteer can’t show alert modals
> claude can’t see everything
> vision quirks remain
> but it’s way closer to real QA than “lol curl localhost:3000”
> typical session now looks like:
> “pwd”
> read progress
> read features
> read git log
> start server
> sanity test
> fix broken stuff
> choose next feature
> implement
> test
> commit
> leave breadcrumbs
> repeat
> four classic failure modes solved with structure:
> – agent declaring victory too early → feature list
> – messy environment → git progress logs
> – premature ‘passes’ → real testing
> – agent forgot how to run app → _init.sh
> does it solve everything? no
> still open questions:
> single agent vs multi-agent division of labor
> maybe future = dedicated QA agent, cleanup agent, test writer agent
> maybe research workflows get similar scaffolding
> maybe finance models get their own version
> but the core insight stands:
> long-running agents don’t fail because they’re dumb
> they fail because we throw them into multi-session hell
> without giving them the engineering rituals humans rely on
> give them structure, tools, tests, logs, diffs
> they stop acting like goldfish
> and start acting like teammates
New on the Anthropic Engineering Blog: Long-running AI agents still face challenges working across many context windows.
We looked to human engineers for inspiration in creating a more effective agent harness.
anthropic.com/engineering/ef…