Daniel Shepard

Daniel Shepard

9 Photos and videos

Tweets

Pinned Tweet

Daniel Shepard @danielwshepard

Apr 20

The last few months I have been working on a new Benchmark. Introducing AutomationBench. Trying to measure the cutting edge of model's capabilities in real world business workflows across multiple apps and noisy data. The best models haven't beat 10% yet.

648

Mike Knoop

Daniel Shepard retweeted

Mike Knoop

@mikeknoop

Jun 9

Zapier AutomationBench being used to report Tool Use performance on Fable 5's model card

Claude

@claudeai

Jun 9

Replying to @claudeai

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision. The longer and more complex the task, the larger Fable 5’s lead over our other models.

Benchmark table titled Mythos 5 & Fable 5, comparing Claude Mythos 5 and Fable 5 against Claude Mythos Preview, Claude Opus 4.8, GPT 5.5, and Gemini 3.1 Pro.

ALT Benchmark table titled Mythos 5 & Fable 5, comparing Claude Mythos 5 and Fable 5 against Claude Mythos Preview, Claude Opus 4.8, GPT 5.5, and Gemini 3.1 Pro.

1,311

Daniel Shepard

Daniel Shepard @danielwshepard

Jun 9

Fable 5 seems better than Opus in every way. Like Opus is to Sonnet. It works smarter rather than harder. Cost is 2x Opus but cost per task was only 17% more on max reasoning! Fable is much more efficient with tokens than other models. ~1/2 the cost of GPT 5.5 xhigh.

Wade Foster

@wadefoster

Jun 9

Claude Fable is here: the first model in their new Mythos series. It's the new top score on @Zapier's AutomationBench at 17.4%, just two weeks after Opus 4.8 set the record at 15.5%. Our AutomationBench measures what enterprises actually care about: can a model do the work? Find the right CRM record, send the right follow-up, update the right system without breaking anything? We tested 600 tasks across 6 domains. Here’s what we saw: Fable knows when to work smarter instead of harder. That means fewer timeouts and fewer wasted tokens in production. EXAMPLE: One task asked the model to reconcile employee benefits across countries. The HR system's benefit-plans endpoint returned a 404. Fable hit it once, immediately pivoted to the team's spreadsheet and inbox, found the plan data there, and finished the task. Meanwhile, Opus moved on and missed a key detail. That's the Fable pattern. It follows complex instructions precisely (especially the "leave these ones alone" kind), and when it hits a dead end, it goes looking somewhere else instead of spinning its wheels and wasting tokens. PRICING: You may have seen that Fable is 2x the price of Opus. But that's the model rate, not the task cost. In Zapier, Fable came in at $3.67 per task at max effort, only 17% more than Opus 4.8 max at $3.14. tl;dr: Who should immediately upgrade their workflows from @claudeai's Opus to Fable? - Operations & HR - Long Horizon Tasks needing reliability and autonomy - Any workflows where precision accuracy matter more than cost

2,371

Daniel Shepard

Daniel Shepard @danielwshepard

Jun 9

From Fable's system and model cards:

Daniel Shepard

Daniel Shepard @danielwshepard

Jun 9

Our CEO Wade talks AutomationBench with examples.

Andrew Warner

@AndrewWarner

Jun 9

🚨 Anthropic released Claude Fable 5 It's Mythos, but safe. The BIG question: Is it dependable enough to use apps to grow your business? @wadefoster's team at @zapier ran it through 600 real-world business uses. Key results: 1. It stays on track - if you ask it about a specific topic in a specific Slack channel, it won't merge data in from other channels and topics. 2. It's the most resourceful - They told it to get HR data from an API that was down. It quickly switched from using the failed API to searching email & spreadsheets. (GPT 5.5 hit the down API 22 times!) 3. It routes intelligently - They asked it to take leads from multiple sources and send each to the right salesperson. It kills at operational tasks like that. BUT: 1. For sales and marketing tasks, GPT 5.5 is still more dependable. 2. Fable is crazy expensive ($3.67/task vs $0.87 for Gemini 3.5 Flash) If you love numbers (like me) the AutomationBenchmark leaderboard is below.

7:22

Daniel Shepard

Daniel Shepard @danielwshepard

Jun 9

Fable 5 is out. AutomationBench made the model card! (under Tool use)

Claude

@claudeai

Jun 9

Replying to @claudeai

ALT Benchmark table titled Mythos 5 & Fable 5, comparing Claude Mythos 5 and Fable 5 against Claude Mythos Preview, Claude Opus 4.8, GPT 5.5, and Gemini 3.1 Pro.

180

Daniel Shepard

Daniel Shepard @danielwshepard

Jun 2

This is a great talk on benchmarks! A good overview, some popular benchmarks, all the variables that can change results, and things to watch out for.

Florian Brand

@xeophon

Jun 1

The talk is now on YouTube! Link: youtube.com/watch?v=kmTMc-fV…

Daniel Shepard

Daniel Shepard @danielwshepard

May 29

Talked with Andrew Warner on what AutomationBench measures and Opus 4.8!

Andrew Warner

@AndrewWarner

May 29

Opus 4.8 is doing what 4.7 refused to do. 4.7 refused tasks related to: • diversity hiring • finance • paychecks Said "too risky." @zapier tests every model by asking it to do a set of tasks and sees how many they get right. I asked the guy who runs their benchmark work to teach me what each model can do and where they fail. 4.8 does the most multi-task work well, but it's not the winner for every task.

10:36

2,003

AMC

Daniel Shepard retweeted

AMC @TweetAnnaMarie

May 28

AutomationBench tests how models perform on the trickiest, stickiest real-world workflows we know customers are actually trying to automate. 600 tasks, 6 domains, deterministic scoring. And today our scores are featured on @AnthropicAI's official launch scorecard.

225

Lisan al Gaib

Daniel Shepard retweeted

Lisan al Gaib

@scaling01

May 28

Opus 4.8 ranks #1 on AutomationBench AutomationBench measures whether an agent can complete a realistic end-to-end business workflow

149

15,066

Zapier

Daniel Shepard retweeted

Zapier

@zapier

May 28

Opus 4.8, the first model to break 15% on AutomationBench, is now live in Zapier! It handles complex HR, Finance, and multi-app workflows better than anything else we've tested: refusals dropped from 20% to 4% Opus 4.7 would see a sensitive task and stop, but 4.8 keeps going

5,698

Logan Kilpatrick

Daniel Shepard retweeted

Logan Kilpatrick

@OfficialLoganK

May 21

Gemini 3.5 Flash ranks #1 on Automation Bench (from Zapier), beating every other frontier model at a much lower cost

180

1,256

135,340

Wade Foster

Daniel Shepard retweeted

Wade Foster

@wadefoster

May 20

Gemini 3.5 Flash just dropped. It’s the highest AutomationBench score yet. We benchmark every major model on real workflows across Sales, Marketing, Ops, Support, Finance, and HR, so you know what works best inside Zapier. Today, @GeminiApp 3.5 Flash set a new record. It crushed Operations (20%) and HR (19%), the domains with the most step coordination and strict policy adherence. It even scored higher than GPT 5.5 at xhigh effort (and at a fraction of the cost). Where it struggled: strict output formats, and making decisions based on math it has to do on its own. tl;dr: This is the most persistent model we've tested yet. And at $1.50 per million input tokens, this is built for cost-effective workflows. Try it in @Zapier now.

1,755

Wade Foster

Daniel Shepard retweeted

Wade Foster

@wadefoster

May 15

We’re on the hunt to find the best small AI model. Here's the latest AutomationBench scorecard 👇 (Based on 2-step workflows with explicit instructions) Most automation in production today doesn't run on the biggest model available. It runs on whatever hits the cost-performance ratio that works for the workflow. That's why we benchmark the small models, not just frontier. The price gap is massive. Opus 4.7 Max is $1.80 per task vs Haiku's $0.0183. GPT-5.5 High? $6.31 per task vs. 5.4 Nano's $0.0035. 1800x difference. One to watch: Gemini 3.1 Flash Lite came out of preview last week. Performs almost as well as 5.4 with no reasoning, at roughly the price of nano. AutomationBench is free and open-source: github.com/zapier/Automation…

6,681

Zapier

Daniel Shepard retweeted

Zapier

@zapier

Apr 23

GPT-5.5 just hit 12.9% on our AutomationBench leaderboard First model to break 10% When context is missing, most models stop. GPT-5.5 keeps checking emails, docs, and chats until it knows what to do

2,604

Daniel Shepard

Daniel Shepard @danielwshepard

Apr 20

648

Daniel Shepard

Daniel Shepard @danielwshepard

Apr 20

Leaderboard: zapier.com/benchmarks Github: github.com/zapier/Automation… Prime Intellect Environment: app.primeintellect.ai/dashbo… White Paper: res.cloudinary.com/zapier-me…

AutomationBench: AI Agent Benchmarks | Zapier

Zapier's AI benchmark assessment scores LLMs on end-to-end workflow execution across real tools and business areas. Deterministic grading. No vibes.

zapier.com

106

Daniel Shepard

Daniel Shepard @danielwshepard

Apr 22

The white paper is up on arxiv now: arxiv.org/abs/2604.18934

AutomationBench

Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single...

arxiv.org

Robin Salimans

Daniel Shepard retweeted

Robin Salimans

@SalimansRobin

Apr 21

It’s been super fun working on this with @danielwshepard! It’s truly a great benchmark. Also thanks @mikeknoop for helping us figure things out along the way and @PrimeIntellect for the great verifiers framework Lab 🤝

Wade Foster

@wadefoster

Apr 20

We built an AI benchmark that measures real work. Today we're releasing it to everyone. AI evals tell you whether a model can do complex reasoning or generate code. Useful, but usually not the question our customers ask. They want to know: can this model find the right CRM record, send the right follow-up, and not break anything along the way? We went looking for a benchmark that tested that. Nobody had built one, so we did. @Zapier’s AutomationBench drops AI models into realistic business environments across six domains (Sales, Marketing, Ops, Support, Finance, HR) and checks whether the work actually got done. The tasks include live CRM data, inbox threads with ambiguous context, and multi-step tool chains where one wrong call cascades. Scoring is deterministic: either the right records were updated and the right messages were sent, or they weren't. It’s useful enough that we're releasing it publicly today. Open task set, open methodology, open leaderboard. Everyone should have access to this. No model has cracked 10%. Yet. Try it here: zapier.com/benchmarks

141

will brown

Daniel Shepard retweeted

will brown

@willccbb

Apr 21

been a lot of fun seeing this environment together @zapier is cooking

Prime Intellect

@PrimeIntellect

Apr 21

We're excited to support the release of @Zapier's AutomationBench on the Environments Hub, measuring frontier performance on real Zapier workflows. Across 6 domains, 47 tools, and 600 tasks, frontier models all score under 10%.

108

15,557

Prime Intellect

Daniel Shepard retweeted

Prime Intellect

@PrimeIntellect

Apr 21

Wade Foster

@wadefoster

Apr 20

102

35,060

Mike Knoop

Daniel Shepard retweeted

Mike Knoop

@mikeknoop

Apr 20

I gave input on this new benchmark from Zapier. It's designed on real automation patterns from ~2B tasks across ~4M Zapier customers and challenges both multi-tool orchestration and tool *construction*. SOTA is 10% at $2/task.

Wade Foster

@wadefoster

Apr 20

7,192

Daniel Shepard

Daniel Shepard @danielwshepard

Apr 20

PrimeIntellect Lab is a major unlock for our RL workflows. We spun up experiments with very little setup to pressure-test our new AutomationBench benchmark, which surfaced reward-hacking opportunities we could then fortify against. Kicking off a training run was a single command.

100

22,444