wolfbench.ai // @WolframRvnwlf's new evaluation framework for models and agents: because one score is not enough! // brought to you by @CoreWeave/@wandb

Joined March 2026
11 Photos and videos
Pinned Tweet
We spent $11,081.12 evaluating @AnthropicAI's Claude Fable 5 on WolfBench. Our most expensive benchmark yet. And it did not even top the charts. Not because it lacked capability, but because it kept refusing. Details in thread: ๐Ÿงต
7
11
57
13,272
We spent $11,081.12 evaluating @AnthropicAI's Claude Fable 5 on WolfBench. Our most expensive benchmark yet. And it did not even top the charts. Not because it lacked capability, but because it kept refusing. Details in thread: ๐Ÿงต
7
11
57
13,272
Even without refusals, real agentic failure patterns remained. Fable's biggest weakness was overconfident self-verification: it often declared victory too early once the solution looked plausible, while the actual benchmark checks still caught wrong output, messy cleanup, missed edge cases, or slow code. Claude Fable 5 may be exceptional, but it is currently not the best fit as a general-purpose agentic daily driver. It is too expensive and too refusal-prone to turn its strengths into efficient, reliable agentic work.
1
4
456
Explore the full results at WolfBench.ai: compare models and agents in the interactive chart, and click any bar to open the corresponding evals and traces in @weave_wb for deeper inspection.

1
327
For benchmarks, I keep agent versions stable so results stay comparable. But new models can expose agent-side bugs. Here, updating @openclaw from 2026.3.11 to 2026.4.23 lifted Kimi K2.6 from 4% to 60% on @WolfBenchAI due to crucial fixes in how the agent handles its tool calling.
3
7
557
GPT-5.5 takes over WolfBench! Itโ€™s now the #1 model, ahead of Claude Opus 4.7 and 4.6, GPT-5.4, Sonnet 4.6, Kimi K2.6, Gemini 3.1 Pro, and more. Notable findings after 30 runs (40h runtime, >1.7B tokens, ~$3K cost): - @OpenAI's GPT-5.5 is the best model we ever tested. - @cursor_ai's Agent CLI (CA) is the best agent we ever tested. - @NousResearch's Hermes Agent (HA) outperformed OpenClaw (OC). - With Hermes, going from medium to xhigh reasoning only improved consistency, not capability. Note: This is WolfBench, where we look at more than just the average score, because one metric is not enough. The golden โˆ… score is the actual 5-run average, which most other benchmarks report as their only score. โ˜… shows the ceiling (what percentage of the full benchmark this model agent combination solved at least once across all runs). โ–  shows the solid base (what percentage of the full benchmark it solved consistently in every run).
3
3
30
2,969
Let's compare the WolfBench top model, GPT-5.5, with our #2, Claude Opus 4.7: - @openclaw is still better on Opus 4.7 than on GPT-5.5: 75% vs. 70%, with a slightly higher ceiling and base. This is also the third-highest score across all models and agents - only @cursor_ai and Terminus-2 (the official @terminalbench 2.0 test harness) rank higher, both at 77%. - When no reasoning level is set, the OpenAI API defaults GPT-5.5 to medium, while the Anthropic API defaults Opus 4.7 to no thinking. That's why Terminus-2 and Hermes Agent have different effort levels. - Note that higher effort levels don't necessarily improve scores in agentic benchmarks - thinking harder can actually make the model dumber: wandb.ai/wandb_fc/wolfbench-โ€ฆ - Still have to evaluate Cursor with Opus 4.7; with 4.6, it got 63%.
4
278
Replying to @OpenAI
Visit wolfbench.ai for the full lineup of models and agents. The site is fully interactive: filter and sort by models, agents, metrics, and scores, or click any bar to jump straight to the corresponding @weave_wb evals and traces. Full transparency for all 300 runs!
1
3
239
WolfBench retweeted
the super interesting thing that I find not enough people talking about is OpenClaw topping the T2 leaderboard for Opus 4.7 with thinking off (@WolfBenchAI eval harness) l - OC also a generic harness unlike the other harnesses in the below lb which are likely benchmaxxed for coding
FYI Claude Code is mostly a vibe-coded product (as they say, 100% written by Claude) It's the worst harness for Opus 4.6 among ANY harness on Terminal-Bench 2
3
1
3
451
WolfBench retweeted
That's fair. But this one is a bit different and tells a realistic story (my custom testing pipeline share more than half of what it uses) .
1
3
174
WolfBench retweeted
Not even ready for something to outperform Claude ๐Ÿคฏ
Hermes Agent outperformed Claude Code and OpenClaw as an agentic harness for both Opus 4.6 and GPT-5.4 on 89 real-world tasks. Not just higher scores but a higher floor. More tasks solved reliably, every single run. @teknium @NousResearch really cooked with this one. ๐Ÿ”ฅ
1
4
287
WolfBench retweeted
๋ฉฐ์น ์ „๋ถ€ํ„ฐ ์ž๊พธ Hermes ์—์ด์ „ํŠธ์— ์‹ ๊ฒฝ์ด ์“ฐ์ธ๋‹ค. ์‚ฌ์‹ค OpenClaw๊ฐ€ ์ข€ ๋” ์˜ค๋ž˜ ์‹œ์žฅ์„ ์žฅ์•…ํ•  ์ค„ ์•Œ์•˜๋Š”๋ฐ, ์•„์ง ๊ฒ€์ฆ์€ ์•ˆ๋์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ๊ฒฝ์Ÿ์ž๊ฐ€ ๋“ค์–ด์˜จ ๊ฒƒ ๊ฐ™๋‹ค. ๋ฏธ๊ตญ์— NousResearch๋ผ๋Š” ํŒ€์ด ์žˆ๋‹ค. Nous Research๋Š” ์˜คํ”ˆ์†Œ์Šค AI ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ์•ž์„œ๊ฐ€๋Š” ์Šคํƒ€ํŠธ์—…/์—ฐ๊ตฌ ํŒ€ ์ค‘ ํ•˜๋‚˜์ด๊ณ . ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” โ€œuser-aligned(์‚ฌ์šฉ์ž ์ •๋ ฌ)โ€ ๋ชจ๋ธ๋กœ ํฐ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊ทธ๋“ค์ด ๋งŒ๋“  Hermes Agent๊ฐ€ 89๊ฐœ ์‹ค์ œ ์ž‘์—… ํ…Œ์ŠคํŠธ์—์„œ Claude Code์™€ OpenClaw๋ฅผ ์•ž์งˆ๋ €๋‹ค. ์ ์ˆ˜๋งŒ ๋†’์€ ๊ฒŒ ์•„๋‹ˆ๋ผ "๋ฐ”๋‹ฅ"์ด ๋†’์•˜๋‹ค. ๋งค๋ฒˆ ๋” ๋งŽ์€ ์ž‘์—…์„ ์•ˆ์ •์ ์œผ๋กœ ์™„๋ฃŒํ–ˆ๋‹ค๋Š” ๋œป์ด๋‹ค. ๊ทธ๋Ÿผ ์™œ ๊ทธ๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”์„๊นŒ? ํ•ต์‹ฌ์€ ํ•˜๋„ค์Šค๋‹ค. ํ•˜๋„ค์Šค๋Š” AI ๋ชจ๋ธ์„ ๊ฐ์‹ธ๋Š” ํ‹€์ด๋‹ค. ๊ฐ™์€ Opus 4.6์ด๋ผ๋„ ์–ด๋–ค ํ•˜๋„ค์Šค์— ๋„ฃ๋А๋ƒ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค. Hermes์˜ ์ฃผ์žฅ์€ "์šฐ๋ฆฌ ๋ชจ๋ธ์ด ๋” ์ข‹๋‹ค"๊ฐ€ ์•„๋‹ˆ๋‹ค. "๊ฐ™์€ ๋ชจ๋ธ์„ ๋” ์ž˜ ์“ฐ๋Š” ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค"๋Š” ๊ฑฐ์— ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค. ๊ทธ ๊ตฌ์กฐ์˜ ํ•ต์‹ฌ์€ ํ•™์Šต ๋ฃจํ”„๋ผ๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ด๋‹ค. Claude Code๋Š” ๋งค๋ฒˆ ์ƒˆ๋กœ ์‹œ์ž‘ํ•œ๋‹ค. OpenClaw๋Š” MEMORY.md๋กœ ๊ธฐ์–ต์„ ์ˆ˜๋™ ๊ด€๋ฆฌํ•œ๋‹ค. ๊ธฐ์–ต์„ ์œ ์ง€ํ•˜๊ฒŒ ์…‹ํŒ…์„ ํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์ธ๊ฐ„ ๋ชซ์ด๋‹ค. Hermes๋Š” ์‹œ์Šคํ…œ์ด ์กฐ๊ธˆ ๋‹ค๋ฅด๋‹ค. ๋ณต์žกํ•œ ์ž‘์—…์ด ๋๋‚˜๋ฉด ์—์ด์ „ํŠธ๊ฐ€ ์ž์œจ์ ์œผ๋กœ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์Šคํ‚ฌ์„ ์ƒ์„ฑํ•˜๊ณ  ์ €์žฅํ•œ๋‹ค. ๋ญ˜ ๊ธฐ์–ตํ• ์ง€, ๋ญ˜ ์Šคํ‚ฌ๋กœ ๋งŒ๋“ค์ง€ ์—์ด์ „ํŠธ๊ฐ€ ์Šค์Šค๋กœ ํŒ๋‹จํ•˜๋Š” ๊ตฌ์กฐ๋‹ค. ๊ณต์‹ ์„ค๋ช… ๊ทธ๋Œ€๋กœ "built-in learning loop", "autonomous skill creation", "skills self-improve during use." ์ธ๊ฐ„์ด ์•„๋ฌด๊ฒƒ๋„ ์•ˆ ํ•ด๋„ ์—์ด์ „ํŠธ๊ฐ€ ์ ์  ์˜๋ฆฌํ•ด์ง„๋‹ค. ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ๋Š” Hermes๋ฅผ "Claude Code ์Šคํƒ€์ผ CLI์™€ OpenClaw ์Šคํƒ€์ผ ๋ฉ”์‹œ์ง• ์—์ด์ „ํŠธ์˜ ์ค‘๊ฐ„"์œผ๋กœ ๋ถ€๋ฅด๊ธฐ๋„ ํ•œ๋‹ค. ๋‘˜ ๋‹ค ๋˜๋ ค ํ•œ๋‹ค๋Š” ๋œป์ด๋‹ค. ํ„ฐ๋ฏธ๋„์—์„œ๋„, ํ…”๋ ˆ๊ทธ๋žจ์—์„œ๋„, VPS์—์„œ๋„. v0.2.0 ์ถœ์‹œ ์ดํ›„ ๋น ๋ฅด๊ฒŒ ์Šคํƒ€ 10,000๊ฐœ๋ฅผ ๋„˜๊ฒผ๊ณ , ํ˜„์žฌ 22,000๊ฐœ๋ฅผ ๋ŒํŒŒํ–ˆ๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ๋„ ์ด ์ „๋žต์€ ์˜๋ฆฌํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ์‚ฌ๋žŒ๋“ค์€ "3% ๋” ๋˜‘๋˜‘ํ•œ ๋ชจ๋ธ"๋ณด๋‹ค "๋‚˜๋ฅผ ๊ธฐ์–ตํ•˜๋Š” ์—์ด์ „ํŠธ"๋ผ๋Š” ์Šคํ† ๋ฆฌ์— ๋” ๋Œ๋ฆฐ๋‹ค. Hermes์˜ ์Šฌ๋กœ๊ฑด "The agent that grows with you"๋Š” ์„ฑ๋Šฅ์ด ์•„๋‹ˆ๋ผ ๋‚˜์™€ ์—์ด์ „ํŠธ์˜ ๊ด€๊ณ„๋ฅผ ํŒ๋‹ค. AI ์—์ด์ „ํŠธ์˜ ๋‹ค์Œ ์ „์Ÿํ„ฐ๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ์•„๋‹ˆ๋‹ค. ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ๋ฐฐ์šฐ๊ณ , ์–ผ๋งˆ๋‚˜ ์˜ค๋ž˜ ๊ธฐ์–ตํ•˜๋А๋ƒ๊ฐ€ ์•„๋‹๊นŒ? ๊ทธ๋ฆฌ๊ณ  ๊ฐœ์ธ ๋งž์ถค ์—์ด์ „ํŠธ ๋ธŒ๋žœ๋“œ๊ฐ€ ์ ์  ๋‹ค๊ฐ€์˜ค๋Š” ๋А๋‚Œ์ด๋‹ค. ๋‚˜๋„ ์˜ค๋Š˜ ํ•œ๋ฒˆ ์„ค์น˜ํ•˜๊ณ  ๋Œ๋ ค๋ณด๋ ค๊ณ  ํ•œ๋‹ค.
Hermes Agent outperformed Claude Code and OpenClaw as an agentic harness for both Opus 4.6 and GPT-5.4 on 89 real-world tasks. Not just higher scores but a higher floor. More tasks solved reliably, every single run. @teknium @NousResearch really cooked with this one. ๐Ÿ”ฅ
23
49
242
25,625
We just published our evals for Agent Harnesses on WolfBench and Hermes out of the box came out on top. x.com/WolfBenchAI/status/203โ€ฆ

Hermes Agent outperformed Claude Code and OpenClaw as an agentic harness for both Opus 4.6 and GPT-5.4 on 89 real-world tasks. Not just higher scores but a higher floor. More tasks solved reliably, every single run. @teknium @NousResearch really cooked with this one. ๐Ÿ”ฅ
6
639
Hermes Agent outperformed Claude Code and OpenClaw as an agentic harness for both Opus 4.6 and GPT-5.4 on 89 real-world tasks. Not just higher scores but a higher floor. More tasks solved reliably, every single run. @teknium @NousResearch really cooked with this one. ๐Ÿ”ฅ
8
20
173
93,862
Key takeaways from our latest eval: > Hermes Agent (default settings) hits 64% avg on Opus 4.6 vs Claude Code's 63% and OpenClaw's 58% โ€” but the solid base jumps from 45%/42% to 49%. > On GPT-5.4 the gap is massive: 66% avg vs Claude Code's 48% and OpenClaw's 61%, solid base 47% vs 22%/45%. > It takes GPT-5.4 with xhigh effort for OpenClaw to surpass Hermes Agent with default=medium effort. > Only with xhigh effort could OC surpass HA with GPT-5.4. Full breakdown here: wolfbench.ai
1
1
23
2,262
GPT 5.4 is not just more reliable now with the latest @openclaw version, it's the best model I've tested on @WolfBenchAI, surpassing even Opus 4.6 with just its default settings (low reasoning for GPT, adaptive reasoning for Opus). And with xhigh thinking, it goes even higher! ๐Ÿš€
New @openclaw beta bits are up! With Hunter๐Ÿน Alpha (1M context!) and Healer๐Ÿฉน Alpha FREE stealth models from @OpenRouter Also, GPT 5.4 and @Kimi_Moonshot Coding now are more reliable, and lots of fixes around ACP and message handling. github.com/openclaw/openclawโ€ฆ
1
9
3,198
What we see here is not only GPT 5.4's average score raising from abysmal 31% to the top score of 61% (thinking: low) or even 71% (t: xhigh), but also the solid base - tasks it always solves - from 7% to 45% or even 52%. That means it's not only good on average, but solidly good and constantly reliable. Its ceiling rose to astounding 85% on xhigh, so in theory it could solve almost all of the tasks. If you have the funds, this looks to be your best choice. But if you want to save some money, using the default low thinking still is second only to this.
2
197
WolfBench retweeted
Replying to @wandb @zubinaysola
Another drop today, we first announced on @thursdai_pod , we launched wolfbench.ai! x.com/WolfBenchAI/status/203โ€ฆ

Introducing WolfBench: @WolframRvnwlf's new evaluation framework for models and agents, brought to you by @wandb Single score metrics don't adequately describe model performance and capabilities. Here's how the new WolfBench framework solves that problem: wandb.ai/wandb_fc/wolfbench-โ€ฆ
1
2
3
840