On Evals - getting messages on “ok so how do I actually start learning this?”
there is no better way than by just doing so you can copy this to Claude Code and get started today
<instructions>
1. Go look up the
@harborframework and the Terminal Bench 2.0 dataset. Go look up the Harbor Skills GitHub repo for help. Pick 1 Task in the dataset and explain every single piece that’s in that task folder
2. Explain what my agent sees when it does the task, what it has to output, and how we know if it got the problem right?
3. Now let’s actually run a Task using the built in Claude Code integration, it’s just a flag
4. Once that’s done let’s read the ATIF file that was produced together and help me understand what just happened. Did we pass the task? If not can we dig into why it failed? Go check the verifier logic to see what went wrong.
5. Ok let’s try to improve our agent by adjusting the prompt. And let’s rerun on a few tasks? Is this helping?
6. Ok we’re doing evals! Using this same format, help me make my own. Let’s do this together
…
</instructions>
Spend a few days reading a bunch of traces, actually running evals, understanding traces, internalizing agent failure modes, and being super in the loop of what the agent sees and does
Have fun! Evals are super important, they don’t have to be scary. DM if I can help or just tweet out what you’re doing, someone will help I promise, we’re all learning