agents for work @evalopsdev

Joined November 2015
301 Photos and videos
Pinned Tweet
Your agents shouldn’t be loose scripts with credit cards and tool access. They should run through a control plane.
Coming soon.
2
305
Jonathan Haas retweeted
Here are some principles you can infer from @satyanadella's paragraph: - There will be a better model tomorrow. - Prompts are great for building POCs, but terrible at specifying system behaviors. - To switch models easily, you need good evals and a system for generating and holding a new prompt accountable for a given model. - With such a system, you can almost certainly use a model magnitudes faster and cheaper than frontier models. - Evals are THE asset for all enterprises. - Evals should never stop growing. 🤔
11
27
215
26,551
Still blown away by @AmpCode. I spent ~5 hours debugging something across Codex, Claude, and hand-written tests (I know!) like it was 2019. Amp solved it in ~6 minutes with one oracle session and an approach I hadn’t considered. @sqs and team are cooking.
3
30
4,791
Jonathan Haas retweeted
imagine telling your customers there's a small chance you'll randomly decide they're using your product wrong and you won't tell them but will secretly silently sabotage their work
41
206
2,991
107,857
I uploaded Anthropic’s own published system card to Claude and asked “wdyt abt it?” Claude refused to read it because the system card contains safety-sensitive topics. We’ve reached AI safety so advanced it cannot inspect the AI safety document.
1
65
Time to complain about Claude speeds again? :)
2
86
Endless missing model spam on Codex? @OpenAIDevs
1
133
funniest possible outcome is the AI reads the S-1 announcement, summarizes it neutrally, and adds 'this is after my knowledge cutoff so I can't verify it'" - which, for the record, is exactly what Claude did twenty minutes ago
Anthropic has confidentially submitted a draft S-1 registration statement to the Securities and Exchange Commission. Pending completion of SEC review, this gives us the option to pursue an initial public offering. Read more: anthropic.com/news/confident…
1
192
Opus 4.8 appears amazing if your workflow is “pay premium tokens to supervise a very articulate coin flip.”
117
Every team has a dashboard for latency. Most teams find out about token spend from finance. Cost is just an eval nobody wrote.
Your token spend was a number you could've gated on. Instead it's a number you get to explain.
243
Jonathan Haas retweeted
Self-driving cars are fun because you never see competing SaaS products having a literal standoff in the street
326
910
14,899
1,208,244
Jonathan Haas retweeted
can't believe i spent my whole life becoming Good At Computer only for Computer to become Better At Computer
189
2,123
35,941
735,089
my commit history this year is 60k commits and my contribution is 'told it to stop being so confident on Tuesdays'
1
213
Jonathan Haas retweeted
everyone's like "how big is your team" brother. it's one agent. it's opening PRs against itself. i haven't written code in four months. leave me alone
1
3
383
Things I will not be doing today: – installing Playwright – activating a venv – pip-installing 41 transitive deps to click a button Things I did instead: ported browser-use to Rust, pinned to a frozen upstream SHA, and exposed it over MCP a local JSON-RPC daemon. github.com/evalops/browser-u…
1
3
502
Zain and team are building something absolutely incredible. Check them out!!! 👇
After months in stealth, my co-founder @helloericsf and I are finally sharing @cimentoai with the world. 🌎 AI changed social engineering. Attacks are now personalized, convincing, and cheap to generate at scale.
1
2
4
979
Jonathan Haas retweeted
In the last two weeks: ServiceNow shipped Action Fabric, AWS MCP Server hit GA, Microsoft moved Agent 365 to GA. The agent execution layer is the new cloud. Ignore it now, pay for it in 2027.
1
6
438
Have been stewing on this for ages with @EvalOpsDev
Told you guys! It is all eval/rubrics now.
1
1
2
323