from weights → context → harness engineering
(evolution of agent landscape from 2022-26)
the biggest shift in AI agents had nothing to do with making models smarter.
it was about making the environment around them smarter.
here's how agent engineering evolved in just 4 years, across three distinct phases:
𝗽𝗵𝗮𝘀𝗲 𝟭: 𝘄𝗲𝗶𝗴𝗵𝘁𝘀 (𝟮𝟬𝟮𝟮)
everything was about the model itself. bigger models, more data, better training. scaling laws told us that progress = more parameters.
RLHF and fine-tuning shaped behavior. if you wanted a better agent, you trained a better model.
this worked great for single-turn tasks. ask a question, get an answer.
but it hit a wall fast. updating one fact meant retraining. auditing behavior was nearly impossible. and personalization across millions of users from one frozen set of weights? not happening.
𝗽𝗵𝗮𝘀𝗲 𝟮: 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 (𝟮𝟬𝟮𝟯-𝟮𝟬𝟮𝟰)
the realization: you don't always need to change the model. you can change what the model sees.
prompt engineering, few-shot examples, chain-of-thought, RAG. suddenly the same frozen model could behave completely differently based on what you put in front of it.
developers stopped fine-tuning and started iterating on prompts and retrieval pipelines instead. it was cheaper, faster, and surprisingly effective.
but context windows are finite. long prompts get noisy. models attend unevenly (the "lost in the middle" problem is real). and every new session starts fresh with zero memory of what happened before.
context made agents flexible. it didn't make them reliable.
𝗽𝗵𝗮𝘀𝗲 𝟯: 𝗵𝗮𝗿𝗻𝗲𝘀𝘀 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 (𝟮𝟬𝟮𝟱-𝟮𝟬𝟮𝟲)
this is where we are now, and the shift is fundamental.
the question changed from "what should we tell the model?" to "what environment should the model operate in?"
the model is no longer the sole location of intelligence. it sits inside a harness that includes persistent memory, reusable skills, standardized protocols (like MCP and A2A), execution sandboxes, approval gates, and observability layers.
the model stays the same. what changes is the task it's being asked to solve.
a concrete example: a coding agent asked to implement a feature, run tests, and open a PR.
without a harness, the model must keep repo structure, project conventions, workflow state, and tool interactions all inside a fragile prompt.
with a harness, persistent memory supplies context, skill files encode conventions, protocolized interfaces enforce correct schemas, and the runtime sequences steps and handles failures.
same model. completely different reliability.
𝘁𝗵𝗲 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗮𝗰𝗿𝗼𝘀𝘀 𝗮𝗹𝗹 𝘁𝗵𝗿𝗲𝗲 𝗽𝗵𝗮𝘀𝗲𝘀 𝗶𝘀 𝘀𝗶𝗺𝗽𝗹𝗲:
- weights encoded knowledge in parameters (fast but rigid)
- context staged knowledge in prompts (flexible but ephemeral)
- harnesses externalized knowledge into persistent infrastructure (reliable and governable)
each phase didn't replace the previous one. it layered on top. weights still matter. context engineering still matters. but the center of gravity has moved outward.
the most consequential improvements in agent reliability today rarely come from changing the base model.
they come from better memory retrieval, sharper skill loading, tighter execution governance, and smarter context budget management.
building better agents increasingly means building better environments for models to operate in.
there's a great paper on this:
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
paper:
arxiv.org/abs/2604.08224
i also published this deep dive (article) on agent harness engineering, covering the orchestration loop, tools, memory, context management, and everything else that transforms a stateless LLM into a capable agent.