Joined May 2025
2 Photos and videos
terminalbench retweeted
deadline to submit tasks for Terminal-Bench 3.0 is may 31st! the best tasks are the most interesting to measure: realistic useful meaningfully beyond current frontier any piece of valuable work done on a computer is fair game
1
2
6
756
Contribute to Terminal-Bench Science!
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-science-an… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵
1
6
354
We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)
2
12
53
14,703
Thank you to @ekellbuch for leading TB2.1, @Zai_org for Terminal-Bench 2.0 Verified, which informed 11 of the 28 tasks we patched, and @SnorkelAI and @togethercompute for support
2
17
872
terminalbench retweeted
The Terminal-Bench community discovered multiple instances of cheating and reward hacking on the Terminal-Bench 2.0 leaderboard. We're adding some new policies to keep it reliable: • ATIF trajectories required for all passing trials • Reward hacking results in reward 0 for the trial • Cheating results in immediate leaderboard removal Thanks to @davisbrownr, @adamlsteinl, and @NoCommas for flagging the recent occurrences! Detailed blog post in comments ⬇️
4
11
119
11,903
terminalbench retweeted
We independently verified these claims and removed OpenBlocks from the Terminal-Bench 2.0 leaderboard. Thank you @NoCommas for helping us keep leaderboard entries honest! Recent leaderboard submissions are in huggingface.co/datasets/harb… which makes it easy for the community to work together to detect cheating.
13
19
235
30,502