Zortos

Zortos

79 Photos and videos

Tweets

Pinned Tweet

Zortos

@Zortosdev

Jun 13

Introducing OutputBench.dev (inspired by @daradoescode his website for frontend) I wanted coding benchmarks that look less like leaderboard puzzles and more like real engineering work. Same prompt. Real agent harnesses. Public repos. Concrete issues. Tradeoffs you can inspect and argue with. outputbench.dev

6,152

Zortos

Zortos

@Zortosdev

OpenRouter launched Fusion as “Fable-level intelligence at half the price.” I tried it on OutputBench: a real coding implementation benchmark. Task: Stripe webhook → SQLite → Discord. Fusion placed 12th. 2m00s runtime $2.30 cost 393k input tokens 7 implementation issues The big one: the real Stripe SDK path crashes before persistence. So yeah, maybe Fusion works well for research synthesis. But for this coding task, I’d rather use a normal model directly.

121

Zortos

Zortos

@Zortosdev

View more detailed information at outputbench.dev/runs/openrou…

OutputBench - Real-world AI coding benchmarks

Hands-on coding benchmarks for AI models and harnesses. Same prompt, real tools, token costs, quality scores, and concrete issues.

outputbench.dev

107

Zortos

Zortos

@Zortosdev

1/ I benchmarked @Kimi_Moonshot Kimi K2.7 Code across 3 agent harnesses on OutputBench.dev. Same model. Same prompt. Same implementation benchmark. Task: build a FastAPI Stripe webhook that verifies signatures, saves events to SQLite, and forwards clean Discord notifications. Same prompt. Same model. Very different code.

OutputBench - Real-world AI coding benchmarks

Hands-on coding benchmarks for AI models and harnesses. Same prompt, real tools, token costs, quality scores, and concrete issues.

outputbench.dev

179

Zortos

Zortos

@Zortosdev

Embed of the site is wrong ;( it cached the wrong image

Zortos

Zortos

@Zortosdev

5/ OutputBench is not asking “which model writes the nicest answer?” It asks: Did it build the thing? Does it run? Are the tests meaningful? What bugs survived? How much did it cost? How long did it take? The model matters. The harness changes the implementation.

Zortos

Zortos

@Zortosdev

6/ See the whole picture at outputbench.dev

Zortos

Zortos

@Zortosdev

Jun 13

6,152

more replies

Zortos

Zortos

@Zortosdev

Jun 13

Why I built it: Most model comparisons feel too abstract. I wanted to see what models actually ship on the same real task, inside real tools, with inspectable code and honest tradeoffs. Not just: “did it pass?” But: would you trust it in production?

114

Zortos

Zortos

@Zortosdev

Jun 13

I’m adding more suites, models, and harnesses. What should OutputBench benchmark next? Drop a model, harness, or real-world coding task in the comments or DM me.

Zortos

Zortos

@Zortosdev

Jun 13

I am so happy that i did the test yesterday the current benchmark chart looks like this for 1 of the suites what models should i benchmark aswell?

1,754

Dara A.

Zortos retweeted

Dara A.

@daradoescode

Jun 13

this sets a crazy precedent btw

0:32

1,677

Zortos

Zortos

@Zortosdev

Jun 11

Coming soon ™️ inspired by @daradoescode his website An benchmark to test the 1 prompt output of each model and harness and how fast it is based on how many issues the project has and severity it will rank it based on that

488

Zortos

Zortos

@Zortosdev

Jun 11

Zortos

Zortos

@Zortosdev

Jun 11

Replying to @daradoescode

Zortos

Zortos

@Zortosdev

Jun 10

I benchmarked 3 coding models on the same coding task. Same prompt. Same plan. Same requirements. Results: 🥇 @cursor_ai Composer 2.5 — 1:29 🥈 @claudeai Fable 5 Medium — 1:52 🥉 @OpenAI GPT-5.5 Medium Fast — 3:05 But fastest ≠ best implementation. Raw prompt breakdown below.

3:12

139

37,972

Zortos

Zortos

@Zortosdev

Jun 10

wasted all my usage to make this test🙏

2,100