Joined January 2026
9 Photos and videos
Save money running Harbor rollouts โ€ผ๏ธ Sometimes cost is more important that reliability or reproducibility when running rollouts (e.g. during rapid iteration). Now in Harbor you can configure resource enforcement policies to save money.
8
1,768
๐Ÿšจ stop zipping job results ๐Ÿšจ ... upload results to Harbor Hub instead The hub makes it easy to share results with team members, customers, or simply save for later in a centralized place. Example of a TB2.1 job in ๐Ÿงต
1
2
18
4,557
come hang out at CAIS!
the harbor community will be @ CAIS - come say hi! 9am Tue @ RLEval workshop Harbor & Terminal-Bench 3.0 talk by @alexgshaw / me 10:30am Tue @ RLEval workshop OpenThoughts-Agent talk by @AlexGDimakis 4pm Tue @ Agent Software Engineering workshop Harbor Adapters & Harbor Index talk by @LinShi592021 9am Wed: Keynote by @andykonwinski
2
10
1,392
healthcare benchmark, built on harbor!
1/๐ŸงตCan AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows ร— 30 frontier agents. Best agent solves only 28% #AIinHealthcare ๐Ÿ‘‡
2
2
14
2,078
Harbor Framework retweeted
May 21
On Evals - getting messages on โ€œok so how do I actually start learning this?โ€ there is no better way than by just doing so you can copy this to Claude Code and get started today <instructions> 1. Go look up the @harborframework and the Terminal Bench 2.0 dataset. Go look up the Harbor Skills GitHub repo for help. Pick 1 Task in the dataset and explain every single piece thatโ€™s in that task folder 2. Explain what my agent sees when it does the task, what it has to output, and how we know if it got the problem right? 3. Now letโ€™s actually run a Task using the built in Claude Code integration, itโ€™s just a flag 4. Once thatโ€™s done letโ€™s read the ATIF file that was produced together and help me understand what just happened. Did we pass the task? If not can we dig into why it failed? Go check the verifier logic to see what went wrong. 5. Ok letโ€™s try to improve our agent by adjusting the prompt. And letโ€™s rerun on a few tasks? Is this helping? 6. Ok weโ€™re doing evals! Using this same format, help me make my own. Letโ€™s do this together โ€ฆ </instructions> Spend a few days reading a bunch of traces, actually running evals, understanding traces, internalizing agent failure modes, and being super in the loop of what the agent sees and does Have fun! Evals are super important, they donโ€™t have to be scary. DM if I can help or just tweet out what youโ€™re doing, someone will help I promise, weโ€™re all learning
23
29
333
21,157
Harbor Framework retweeted
This eliminates largely the reward hacks we found using BenchJack and make benchmarks much more reliable. Great work!
We're releasing support for running verification in a separate sandbox. Tasks pre-configure artifacts to move from the agent sandbox into the verifier sandbox for the grading phase, improving the security boundary between agent and verifier. Blog post below. Happy building!
1
4
895
Harbor Framework retweeted
ha! yes absolutely @harborframework is a really powerful way to build and run a suite of evals for agents. harbor lets you define a dataset of tasks. each task: - defines the execution env (dockerfile/compose) - the prompt (instructions.md) - the verifier (deterministic, LLM-judge/etc) then run it against a cartesian multiple of: - agent (off the shelf claude/codex/customized - just impl a simple python class) - model - arbitrary args use -n to repeat enough to get stat-sig, -k to control concurrency (definitely use a cloud sandbox provider like islo.dev/rl to run 100s of trials in parallel, FD- i consult for islo)
1
1
2
397
Harbor Framework retweeted
FrontierCS now in @harborframework
We integrated FrontierCS into Harbor and are releasing a preview long-horizon agent leaderboard (up to 835 turns, ~200K output tokens) with Kimi K2.6 @Kimi_Moonshot (score 46.9) and Claude Code Opus 4.7 @claudeai (43.0) ๐Ÿšข. The goal: evaluate frontier coding agents in a setting where they iteratively write code, run experiments, read feedback, and improve in an extremely long loop. FrontierCS tasks are open-ended optimization problems. Each task has a continuous score. There is no single accepted output. Agents need to search for better solutions under a step/time/token budget. This makes FrontierCS a natural fit for agentic evaluation. Just plan, code, test, revise, fail, recover, and keep optimizing. Check out our blog: frontier-cs.org/blog/harbor FrontierCS GitHub: github.com/FrontierCS/Frontiโ€ฆ
1
4
32
3,471
We built Harbor to evaluate agents. But why limit ourselves to just agents? Today we're adding first-class support for evaluating skills, MCPs, prompts, and services. Ablate your agents.
2
42
5,626
Separating the agent sandbox and verifier sandbox now supported in harbor! harborframework.com/docs/tasโ€ฆ Nice writeup below from harbor community member @rishi_desai2 on why this is an important design decision to prevent reward hacking.
Reward hacking is an arms race between coding agents and RL envs. A common eval flaw: the agent and verifier share the same sandbox. If the agent can tamper with the grader, โ€œpassโ€ may just mean โ€œcheated.โ€
1
17
2,230
Harbor Framework retweeted
Evaluate biomedical agents using @harborframework . Congrats to the @phylo_bio team on a great benchmark!
๐—–๐—ฎ๐—ป ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ ๐—ฏ๐—ถ๐—ผ๐—บ๐—ฒ๐—ฑ๐—ถ๐—ฐ๐—ฎ๐—น ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ ๐˜๐—ฎ๐˜€๐—ธ๐˜€ ๐—ฏ๐—ฒ๐—ต๐—ถ๐—ป๐—ฑ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ๐˜€ ๐—ถ๐—ป ๐—ก๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ, ๐—–๐—ฒ๐—น๐—น, ๐—ฎ๐—ป๐—ฑ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ? To find out, we built ๐—•๐—ถ๐—ผ๐—บ๐—ป๐—ถ๐—•๐—ฒ๐—ป๐—ฐ๐—ต, a benchmark we co-developed with the original paper authors and 5 year domain experts to grade AI agents the way a peer reviewer reads a paper: scrutinizing methods, reasoning, and every analytical choice, not just the final answer. As the first track of this benchmark, ๐—•๐—ถ๐—ผ๐—บ๐—ป๐—ถ๐—•๐—ฒ๐—ป๐—ฐ๐—ต-๐——๐—ฎ๐˜๐—ฎ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ contains 100 data-analysis tasks drawn directly from 21 published studies in Nature, Cell, Science, Nature Medicine, and other leading journals. Each task hands the agent a real dataset and a research question, then scores its full analytical trajectory against an expert-authored rubric. What's inside: - ๐Ÿญ๐Ÿฌ๐Ÿฌ ๐˜๐—ฎ๐˜€๐—ธ๐˜€ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐Ÿฑ ๐—ฑ๐—ถ๐˜€๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ฎ๐—ฟ๐—ฒ๐—ฎ๐˜€ (๐—ผ๐—ป๐—ฐ๐—ผ๐—น๐—ผ๐—ด๐˜†, ๐—ถ๐—บ๐—บ๐˜‚๐—ป๐—ผ๐—น๐—ผ๐—ด๐˜†, ๐—ป๐—ฒ๐˜‚๐—ฟ๐—ผ๐—น๐—ผ๐—ด๐˜†, ๐—บ๐—ฒ๐˜๐—ฎ๐—ฏ๐—ผ๐—น๐—ถ๐—ฐ & ๐—ฒ๐—ป๐—ฑ๐—ผ๐—ฐ๐—ฟ๐—ถ๐—ป๐—ฒ, ๐—ฐ๐—ฎ๐—ฟ๐—ฑ๐—ถ๐—ผ๐˜ƒ๐—ฎ๐˜€๐—ฐ๐˜‚๐—น๐—ฎ๐—ฟ) ๐—ฝ๐—น๐˜‚๐˜€ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ฏ๐—ถ๐—ผ๐—น๐—ผ๐—ด๐˜† - ๐Ÿญ๐Ÿณ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐—ฎ๐—น ๐˜๐—ฎ๐˜€๐—ธ ๐˜๐˜†๐—ฝ๐—ฒ๐˜€ (๐—ฒ.๐—ด., ๐—š๐—ช๐—”๐—ฆ/๐—ฒ๐—ค๐—ง๐—Ÿ ๐—ฐ๐—ผ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป, ๐—ง-๐—ฐ๐—ฒ๐—น๐—น ๐—ฟ๐—ฒ๐—ฐ๐—ฒ๐—ฝ๐˜๐—ผ๐—ฟ ๐—ฟ๐—ฒ๐—ฝ๐—ฒ๐—ฟ๐˜๐—ผ๐—ถ๐—ฟ๐—ฒ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€, ๐—ฐ๐—ฒ๐—น๐—น-๐—ฐ๐—ฒ๐—น๐—น ๐—ฐ๐—ผ๐—บ๐—บ๐˜‚๐—ป๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป) - ๐—”๐—ป ๐—ฒ๐˜…๐—ฝ๐—ฒ๐—ฟ๐˜-๐—ฐ๐˜‚๐—ฟ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฟ๐˜‚๐—ฏ๐—ฟ๐—ถ๐—ฐ ๐—ณ๐—ผ๐—ฟ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐˜† ๐˜๐—ฎ๐˜€๐—ธ, ๐˜€๐—ฐ๐—ผ๐—ฟ๐—ถ๐—ป๐—ด ๐Ÿฒ ๐—ฑ๐—ถ๐—บ๐—ฒ๐—ป๐˜€๐—ถ๐—ผ๐—ป๐˜€ ๐—ผ๐—ณ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐—ฎ๐—น ๐—พ๐˜‚๐—ฎ๐—น๐—ถ๐˜๐˜† - ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€-๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐Ÿต ๐—ณ๐—ฟ๐—ผ๐—ป๐˜๐—ถ๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐— ๐˜€ (๐—š๐—ฃ๐—ง-๐Ÿฑ.๐Ÿฑ, ๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ ๐—ข๐—ฝ๐˜‚๐˜€ ๐Ÿฐ.๐Ÿณ, ๐—ฎ๐—บ๐—ผ๐—ป๐—ด ๐—ผ๐˜๐—ต๐—ฒ๐—ฟ๐˜€) ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐Ÿฐ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐—ต๐—ฎ๐—ฟ๐—ป๐—ฒ๐˜€๐˜€๐—ฒ๐˜€ (๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ ๐—–๐—ผ๐—ฑ๐—ฒ, ๐—–๐—ผ๐—ฑ๐—ฒ๐˜… ๐—–๐—Ÿ๐—œ, ๐—ง๐—ฒ๐—ฟ๐—บ๐—ถ๐—ป๐˜‚๐˜€-๐Ÿฎ, ๐—š๐—ฒ๐—บ๐—ถ๐—ป๐—ถ ๐—–๐—Ÿ๐—œ) Headline results: - ๐—™๐—ฟ๐—ผ๐—ป๐˜๐—ถ๐—ฒ๐—ฟ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—น๐—ฒ๐—ฎ๐—ฑ ๐—ฎ๐˜ ๐Ÿณ๐Ÿฏ.๐Ÿฏ/๐Ÿญ๐Ÿฌ๐Ÿฌ, ๐˜„๐—ถ๐˜๐—ต ๐˜€๐˜‚๐—ฏ๐˜€๐˜๐—ฎ๐—ป๐˜๐—ถ๐—ฎ๐—น ๐—ต๐—ฒ๐—ฎ๐—ฑ๐—ฟ๐—ผ๐—ผ๐—บ ๐˜๐—ผ ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ. - ๐—ง๐—ต๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐—ต๐—ฎ๐—ฟ๐—ป๐—ฒ๐˜€๐˜€ ๐—บ๐—ฎ๐˜๐˜๐—ฒ๐—ฟ๐˜€ ๐—ฎ๐˜€ ๐—บ๐˜‚๐—ฐ๐—ต ๐—ฎ๐˜€ ๐˜๐—ต๐—ฒ ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น. - ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—ณ๐—ฎ๐—น๐—น ๐˜€๐—ต๐—ผ๐—ฟ๐˜ ๐—ผ๐—ป ๐—ฏ๐—ถ๐—ผ๐—น๐—ผ๐—ด๐—ถ๐—ฐ๐—ฎ๐—น ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ฒ๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป, ๐—บ๐—ฒ๐˜๐—ต๐—ผ๐—ฑ ๐˜€๐—ฒ๐—น๐—ฒ๐—ฐ๐˜๐—ถ๐—ผ๐—ป, ๐—ฎ๐—ป๐—ฑ ๐˜€๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—ถ๐—ณ๐—ถ๐—ฐ ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด. We hope to make ๐—•๐—ถ๐—ผ๐—บ๐—ป๐—ถ๐—•๐—ฒ๐—ป๐—ฐ๐—ต the most helpful benchmark for biologists to understand how AI agents handle real-world biomedical tasks: where they can be trusted, and where they fall short. We're actively expanding our evaluation effort, and would love to engage the broader scientific community on what comes next. ๐Ÿ“„ biorxiv.org/content/10.64898โ€ฆ ๐Ÿค— huggingface.co/datasets/phylโ€ฆ Thanks to our amazing @phylo_bio team (Minta Lu, @TuXinming , @serena2z , @TianweiShe , @lecong , @jure , @KexinHuang5 ) and our collaborators at @LaudeInstitute , @Stanford , @Harvard , @PKU1898 , @virginia_tech , Humanlaya Data Lab, Xbench: @alexgshaw , JOU-HO SHIH, Bingqing Zhao, Minjie Shen, Haochen Yang, Jielin Yan, Rongchuan Zhang, Xinze Wu, Tingting Li, Xiaobo Hu, Yuan Jiang, Jiayun Dong, Tao Peng.
4
25
3,157
We're releasing support for running verification in a separate sandbox. Tasks pre-configure artifacts to move from the agent sandbox into the verifier sandbox for the grading phase, improving the security boundary between agent and verifier. Blog post below. Happy building!
2
2
33
3,293
Harbor Framework retweeted
Great write up by @adithya_s_k about @harborframework . I want to add some thoughts around coding agents = CUA and Harbor coding envs = computer envs. One of the reasons we built Terminal-Bench was because we saw that terminals/code were/was a powerful way for language models to control a computer. Weโ€™ve always viewed TB as a computer-use benchmark. Coding agents = CUA means measuring coding agents is essentially the same thing as measuring general purpose agents. This is becoming more obvious with products like Claude Cowork, which is essentially a non-technical interface around Claude Code, and OpenAIโ€™s push to making Codex a more general purpose tool. We see this on the Harbor side too. Users create coding tasks. But they also create finance, law, accounting, engineering, general computer work, etc. tasks as well. Terminal-Bench 3.0 will cover all of these domains. The implication is that Harbor becomes a tool for representing and measuring agentsโ€™ abilities to perform arbitrary computer work, which right now is the exact scope that users build agents to automate. In fact, the Harbor Framework (as opposed to the Harbor Format) is just one opinionated way of performing rollouts on Harbor tasks. It works particularly well for agent evals. But there is no reason people canโ€™t/shouldnโ€™t implement other means of performing rollouts on Harbor tasks (e.g. @PrimeIntellect, @GenReasoning, and @tinkerapi all support some variation of a Harbor rollout). Weโ€™ll have some releases around this soon. To summarize, coding agents = CUA, Harborโ€™s coding environments = computer environments, which means the scope of Harbor is probably broader than you think (as our users will attest!)
3
8
111
12,895
Harbor Framework retweeted
As agents get more clever, so do their attempts at benchmark hacking. Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard. This was clearly benchmark hacking and we patched the exploit. But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone. Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them. These were our findings: poolside.ai/blog/through-theโ€ฆ Examples below ๐Ÿ‘‡ 1/
8
23
107
17,201
Harbor Framework retweeted
Evals are specs for agents. Building agents <> Building evals with harbor
You don't need a new IDE. You need a new ISE. Integrated Spec Environment. Spec is the new code. Ship the right spec and your job is basically done.
5
1
19
1,566
Harbor Framework retweeted
Canโ€™t imagine agent research without Harbor!
2
7
804