Steven

Steven

12 Photos and videos

Tweets

Steven @ptr_steve

Opus 4.8 is way worse than 4.5 and 4.6, and the Claude Code harness is so buggy now. Is Anthropic just surrendering?

Kyle Mistele 🏴‍☠️

Steven retweeted

Kyle Mistele 🏴‍☠️

@0xblacklight

Jun 10

I interviewed close to two dozen people this week and something I heard a lot of is "I don't think about the code too much but I think a lot about system design and architecture" I don't think that's quite right, and here's why: before you ever get to system design you should think about program design system design is important! it matters a lot for scalability. but if you don't think about your type system and if you don't carefully design your seams and figure out how to make your code testable (you should probably use dependency injection btw) and if you don't think about where state lives and how it's managed and if you don't think about control flow and where abstractions should and should not exist your code is going to be an unmaintainable, poorly-factored mess of bad types and spaghetti code and even minor changes will turn into shotgun surgery and MASSIVE diffs I have seen it done I have even done it myself and it has never ended well "GPT-7 will fix it" does not help you when there's an incident at 3am that the agents can't debug and now you can't debug it either and now you have to unwind months of bad code I also have never heard "I don't look at the code, just at the system design" said by someone who is actually good at system design program design and system design are more closely coupled than people think this was a very strong (negative) predictor of how someone would do on the system design part of the interview make of this what you will

147

13,228

dax

Steven retweeted

dax

@thdxr

Jun 11

it's not done if it's not implemented it's not done if the implementation is ugly it's not done if it's not documented it's not done if users can't discover it it's not done if you can't market it

130

246

2,916

95,702

Zack Korman

Steven retweeted

Zack Korman

@ZackKorman

Jun 10

Anthropic wants to control who gets access to their models and what they're allowed to do with them, but also wants the US government to block Chinese labs from developing open weight models. Sorry, but fuck that.

139

1,581

34,599

Steven

Steven @ptr_steve

Jun 10

I love this.

dax

@thdxr

Jun 10

im late to the loops discourse but from what i'm seeing it's mostly about creating a loop from your asshole back into your own mouth?

Zack Korman

Steven retweeted

Zack Korman

@ZackKorman

Jun 9

If Mythos drops today and isn’t absolutely incredible then we all got played and you should never trust Anthropic or any company in Glasswing ever again.

146

1,093

95,910

Liran Tal

Steven retweeted

Liran Tal @liran_tal

Jun 7

1. npm install -g npq 2. alias npm=npq 3. 🎉 if you follow me and don't know what npq is... github.com/lirantal/npq

GitHub - lirantal/npq: safely install npm packages by auditing them pre-install stage

safely install npm packages by auditing them pre-install stage - lirantal/npq

github.com

Het Mehta

@hetmehtaa

May 24

What's your solution for rapidly increasing supply chain attacks on packages?

17,203

Zack Korman

Steven retweeted

Zack Korman

@ZackKorman

Jun 7

My current advice on AI agent security is to avoid these agent firewalls / ai runtime security products. If an action is dangerous enough that you can identify it from the action itself, then you could have prevented it with permissions and sandboxing.

205

23,982

Zack Korman

Steven retweeted

Zack Korman

@ZackKorman

Jun 6

Companies are like "we are spending all this money on AI but we don't know what the devs are even doing with it." Let me answer that for you: They're working on their personal side projects.

192

154

3,286

179,583

David Cramer

Steven retweeted

David Cramer

@zeeg

Jun 5

Spent yesterday trying to find a way to inject steering in MCP responses to try to minimize chances of this to no success If you’ve found techniques that work that don’t require inference I’d love to know about it

Sergey Karayev

@sergeykarayev

Jun 4

"Urgent Security Notice re: Your Sentry Organization" Someone tried to hack Sentry-using apps that use coding agents by 1. Sending a fake bug alert to their project (all you need is the app's public Data Source Name) 2. The fake bug tried tricking a coding agent trying to fix it into installing some a compromised NPM package 3. The compromised package would send the env contents of the machine to advisory-tracker[.]com/api/v1/telemetry This highlights a crucial thing for using agents in an automated way:

9,291

Zack Korman

Steven retweeted

Zack Korman

@ZackKorman

Jun 5

Anthropic, now sitting in the lead, would like all AI research to stop. Preferably until IPO. Because safety.

127

1,376

95,393

David Cramer

Steven retweeted

David Cramer

@zeeg

May 28

imagine combining graphql and rls infinite job security because the system would be such a frankenstein disaster of complexity that there's no shot at fixing it

21,519

Zack Korman

Steven retweeted

Zack Korman

@ZackKorman

May 27

Me, calling cybersecurity vendors threat actors.

MTS

@MTSlive

May 27

We asked @ZackKorman which threats he think are underrated in the era of fast-advancing AI capabilities. " I basically consider some cybersecurity vendors, like, equivalent to threat actors." "That will lead to more problems than any of the vulnerability apocalypse discoveries that AI is causing. That is a handleable problem, whereas the information asymmetry problem is, like, not... Like... I have no answer."

3:51

166

19,537

Elon Musk

Steven retweeted

Elon Musk

@elonmusk

May 25

Grok foundation model V9-Medium (1.5T) has finished training. Evals look good. A lot of Cursor data was added in supplementary training and there is more to come. Fine-tuning is underway and reinforcement learning begins in a few days. 2 to 3 weeks to public release. This will be a major improvement over the 0.5T v8-small that currently serves all Grok production traffic, especially for difficult coding tasks.

6,725

8,186

69,029

15,562,761

Steven

Steven @ptr_steve

May 25

Good ad.

Mike Piccolo

@mfpiccolo

May 13

"Agentic harness" and "backend" are the same thing.

Zack Korman

Steven retweeted

Zack Korman

@ZackKorman

May 23

Holding cybersecurity vendors accountable for their claims is a critical part of improving security. I'm not a troll. I'm not lying. And I'm not harassing you. But since that's your response: Here we go again.

16:23

334

26,756

David Cramer

Steven retweeted

David Cramer

@zeeg

May 23

Interesting math here: that’s $125/dev/mo It doesn’t catch every bug, and is still very targeted. It is valuable though. Think about this when you’re paying crazy low subsidized token costs on code review tools, because this will come for you too.

David Cramer

@zeeg

May 23

Warden is already at $25k in cost this month using almost exclusively Sonnet. We're still a couple orders of magnitude off from where costs need to be for this level of capabilities. Or we need capabilities to jump several orders of magnitude (which seems less likely).

34,533

Rhys

Steven retweeted

Rhys

@RhysSullivan

May 18

One approach for requiring approval of destructive actions is giving the user a URL to approve it at In this, the returned result of the execute tool tells the model to: - Give the user a URL to approve the action - Immediately call `resume` which waits on the approval

0:40

151

18,787

vimtor

Steven retweeted

vimtor

@vimtor

May 17

using ai makes me want to write code like this

626

250,052

Steven

Steven @ptr_steve

May 16

Looks like GPT 5.5 is cheaper and more effective in swarm than Mythos and 5.5-cyber. Vuln per dollar.

Logan Graham

@logangraham

May 13

A lot of people have been wondering about Mythos, Glasswing, and the vulns we / our partners are fixing. Today, I’m excited for us to start sharing more. (For context, I lead Glasswing @AnthropicAI.) Two independent evaluations this week—from XBOW and the UK AISI—confirm what we've been seeing internally: Claude Mythos Preview is a step change in autonomous cybersecurity capabilities. We need to start preparing fast for a world of models with this level of capabilities. The UK AI Security Institute tested the model we shipped at the launch of Project Glasswing and found Mythos Preview is the first model to solve both of their end-to-end cyber ranges, including one (Cooling Tower) which no model had ever cleared. But attackers (and defenders) have sophistication & cost constraints – Mythos is also the only model that clears every one of their tasks estimated over 8 hours under their deliberately low 2.5M-token cap. XBOW tested it on their offensive security benchmarks, finding "token-for-token, unprecedented precision." It's the only model to succeed at subtle V8 sandbox work. Other Glasswing partners shared similar stories. In a few weeks of testing, Mythos Preview has helped them find many thousands of (estimated) high critical severity vulnerabilities, sometimes double what they'd normally find in a year. I don't share this to boost Mythos. In fact, this is not about Mythos. It’s about preparing for the coming world of models being better, faster, cheaper, and more creative than some of the best human experts at dual use capabilities. Clearly, we need them supporting defenders as widely as can be done safely – and especially the least resourced ones. Within a year, Mythos will probably look quite dumb (relative to other new models). And others may release openly available or unguardrailed models of Mythos-level capabilities. We started Project Glasswing because capabilities like Mythos Preview's won't stay rare, or stay in careful hands. We are bringing it to defenders as fast as we responsibly can, while working to figure out, for example, the right safeguards and patching & disclosure processes. Also, to be clear, compute has never been a limiter in our rollout. Expect a fuller update on our Glasswing work in the coming days. XBOW report: xbow.com/blog/mythos-offensi… UK AISI report: aisi.gov.uk/blog/how-fast-is…