a bunch here where I’m saying ok Garry’s kinda right?! 👀…in some ways :) we’re making this loop much easier to close out of the box soon
If more people get into evals & traces to ground self-improving agents from Garry’s posts, there’ll be no one happier than me
have written about this at length so will also share some linked materials for anyone (including your Clanker) who wants to dig into more details of building evals & self-improving agent systems:
Traces Evals are the lifeblood of agent improvement loops
We point compute at traces so we can classify what agents did wrong. Yes, but the hard part is figuring out what the error even was and how to fix it in a way that actually generalizes over time (not play whack-a-mole with if-else statements all the time). Is our agent a bad long horizon planner for X tasks? Should we change the model, or add better planning instructions, or use subagents to isolate context because these types of tasks bloat the main window.
Evals encode the behavior we want agents to have in production. Generating evals from traces is how we figure out how to measure the changes we’re making over time. This is why we lean so hard into Tracing Evals tooling with LangSmith (more coming soon on making this loop even easier!).
Skill Learning is ONE great Way to Codify Trace Learnings into Context for your Agent
“skillify”/SkillLearning is great, agreed!! (see our /remember youtube video below blogs on hill climbing coding agents), love that Garry’s discovering Skill Learning from Traces as a mechanism for fixing agent mistakes. Skills are semantic bundlers so they basically encompass everything needed to accomplish a goal in one folder like instructions and code. This reduces search in aggregating cross-source information. Skills have built-in context engineering with progressive disclosure which helps many users.
Skills are great, I love them and we use them heavily, but just a note that there’s other approaches you can use to fix errors in production trace data. We discuss them briefly below! Remember
Things to think about more deeply:
Context Engineering Still Matters even with Skills & Resolvers
We still need good context engineering! If you bloat your context window with TONS of skills that are hard for an agent to disambiguate when to use, then the “Resolver” mechanism will suck you’re back in context-rot world. “Resolvers” are classifiers of intent, you need to protect your context window and make sure the “rules” in the table are self-consistent over time and also not massively bloating context.
Good context engineering is often a search problem! We need to find the right context and pass it into the computation boundary —> the context window. The better we do that without confusing the agent, the better our results.
Maybe that looks like Skill Search?! Maybe similar skills should get merged or subagents should actually spend more compute doing proper skill research and disambiguation. If we use Skills as the primary agent update mechanism, then we need to think about how this works with context as we use agents across month and year timescales.
Building in Higher-Level Primitives
I love Skill-Learning but often it’s a whack-a-mole- solution if not managed properly. For example, if you wanted to build an ultra-long horizon coding agent (think Factory Missions or something on Frontier-SWE), then you need to think through the harness architecture of how to work backwards from the goal like how to recursively use subagents & planning. Or how to manage & share context in a filesystem. Traces often help you uncover local issues and skills help you solve those, but it’s very important today to think about agent architecture and working backwards from big problems to avoid the potential local minima of Skill Learning. It’s tbd how much compute you need to use to uncover good agent architecture primitives to solve very hard problems. Skill Learning to fix scoped problems is great in the meantime and maybe can get us much further with smarter models.
Evals Alongside/Beyond LLM as a Judge
The hardest part of this all is by far figuring out what actually went wrong across Traces at scale testing if the proposed fix works over time! Does it work across models? Does it continue to work if you change something else in the system prompt or add another skill? Evals codify the case into an eval that can be detected in realtime (Online Evals/Monitoring). We need to test this stuff, which is why I like using LLM as a Judge that Garry mentions, but there’s much more we can do (programmatic evals, multi-turn cases, containerizing the eval environment to faithfully reproduce what went wrong) - great start, happy to help extend to make your agents better :)
Could write on this for days but I promise you, we’re thinking SUPER hard about primitives for self-improving agents, mining data from Traces, agent-first tooling that makes this possible, and basically any ways we can be helpful to help builders create the best agents in the world.
We have a lot coming soon, reach out if I can help, let’s cook 🚀