Shaun Smith

Shaun Smith

356 Photos and videos

Tweets

Shaun Smith

@evalstate

18h

Ahead of the curve.

Shaun Smith

@evalstate

Jun 9

Replying to @thsottiaux

I do giant, nested, if-else staircases.

782

Shaun Smith

Shaun Smith

@evalstate

Jun 13

Run a local.... classifier. Export your traces to Hugging Face buckets with an open weight Privacy Filter. No GPU needed - runs directly on CPU. And you can fine-tune that filter to do as you need without anyone looking 👀. Give it a try: fast-agent.ai/guides/privacy…

Privacy-filtered exports

Redact sensitive trace content before saving locally or uploading to Hugging Face.

fast-agent.ai

198

Shaun Smith

Shaun Smith

@evalstate

Jun 13

I had Fable building a feature with /goal, hit the rate limit, went to bed and now Opus had to finish it. Strange thing - since it launched in the now forbidden name of the model has been stuck in my head as Pathos.

109

clem 🤗

Shaun Smith retweeted

clem 🤗

@ClementDelangue

Jun 12

To people in the answers saying "but opus 4.8 is weaker so without fallback, the score would even be higher": this is not necessarily true because of how any benchmark - which is an average of queries - work and what is called "the fallacy of division". Even if Opus 4.8 has a lower average score on AA than Fable 5, it actually performs better than Fable 5 on some benchmark that compose the index of AA, especially where there's high refusal rate of Fable 5 (ex GPQA Diamond, AA-Omniscience). The same would go if you'd take a single benchmark btw as it's always an average of queries and the fact that a model has a higher score on average doesn't mean they answer better on 100% of queries. So it's possible that Fable with Opus 4.8 fallbacks is getting a higher score than pure Fable, even if Opus 4.8 is weaker on average. The challenge is no one knows, except the API provider, which is the challenge I'm pointing out. More details below from Fable (or Opus?) themselves!

clem 🤗

@ClementDelangue

Jun 12

This graph captures what’s broken about AI evals: they structurally favor closed-source APIs that can route, fallback, ensemble, and optimize behind the scenes with no transparency. No offense, @ArtificialAnlys, but how is comparing one model to two models fair?

119

19,126

Shaun Smith

Shaun Smith

@evalstate

Jun 11

Is there a way to get Claude Code to reprint its last message (scrollback breakage is horrendous)?

194

Shaun Smith

Shaun Smith

@evalstate

Jun 11

It's wrong to use Fable to set remotes and create branches, right?

282

Shaun Smith

Shaun Smith

@evalstate

Jun 10

"I was wrong to suggest it" would be far more honest than "You were right to question it" 😐

Pydantic

Shaun Smith retweeted

Pydantic

@pydantic

Jun 10

Pydantic AI version 1.107.0 is out! 🎉 github.com/pydantic/pydantic…

Release v1.107.0 (2026-06-10) · pydantic/pydantic-ai

What's Changed 🛡️ Security Handle UploadedFile consistently with FileUrl in UI adapters by @dsfaccini in #5772 Security advisory: VercelAIAdapter trusts client-controlled provider metadata to...

github.com

741

Shaun Smith

Shaun Smith

@evalstate

Jun 10

Actually, there's a much simpler way - refuse their requests instead.

Shaun Smith

@evalstate

Apr 12

One way you could obfuscate model performance is by making benchmarks too expensive for hobbyists to run.

102

Shaun Smith

Shaun Smith

@evalstate

Jun 10

Guess what's in fast-agent 0.7.17.

Shaun Smith

@evalstate

May 28

guess what's in fast-agent 0.7.13.

131

Shaun Smith

Shaun Smith

@evalstate

Jun 8

Need to test something with Claude Code, update... good to the improvements in the details while I've been away.

136

Shaun Smith

Shaun Smith

@evalstate

Jun 7

Nice 14k run over the old and new bridges between Falster and Sjaelland this morning. 🏃

123

Shaun Smith

Shaun Smith

@evalstate

Jun 5

Did anyone ever read "Dial F for Frankenstein"?

100

Shaun Smith

Shaun Smith

@evalstate

Jun 5

I promise to use these last few hours on ridiculous optimization jobs that no one in their right mind would attempt if they were paying properly for them. 🙏

Shaun Smith

@evalstate

Jun 4

Replying to @thsottiaux

Oh shoot, I think I was in the 15%.

163

Anthony Ronning

Shaun Smith retweeted

Anthony Ronning

@anthonyronning

Jun 3

The industry’s obsession about prompt caching is leaving a lot of genuinely impressive prompt engineering and performance improvements off the table.

11,970

Shaun Smith

Shaun Smith

@evalstate

Jun 2

spawn some sub-agents to to do a detailed review of the codebase, looking for duplication of intent, unnecessary spaghetti-style code, places where enums would be better than staircases, and other good pythonic practice that should have been adopted but wasn't. look for non-obvious defects (there are a couple) and fix by simplification wherever possible.

528

Shaun Smith

Shaun Smith

@evalstate

Jun 2

Need that reset, ngl.

154

Liran Tal

Shaun Smith retweeted

Liran Tal @liran_tal

Jun 2

Very bullish on MCP comeback (yah I know it didn't actually leave us hah but you know) after listening in to @evalstate talk at @ainativedev Good work and interesting stuff at @huggingface too. I need to spend more time on it #ainativedevcon

1,018

Shaun Smith

Shaun Smith

@evalstate

Jun 2

About to kick off at AI Devcon - come say hi if you're here too 🤗

395

Shaun Smith

Shaun Smith

@evalstate

May 31

Such organic in my tl this eve.

101