Drake Thomas

Drake Thomas

131 Photos and videos

Tweets

Drake Thomas @MaskedTorah

Apr 16

I'm excited to explore more of this kind of validation in the future! Feels spiritually similar to an old Paul Christiano blog post that I've long wished could be closer to reality for AI companies: sideways-view.com/2018/02/01…

Honest organizations

Suppose that I’m setting up an AI project, call it NiceAI. I may want to assure my competitors that I abide by strong safety norms, or make other commitments about NiceAI’s behavior. Th…

sideways-view.com

Nathan Calvin

@_NathanCalvin

Apr 16

This part of the 4.7 Opus system card is pretty neat and seems potentially worth emulating (Anthropic showed Mythos the private discussions/evidence underlying the system card and asked Mythos if the Opus system card accurately characterized that private evidence)

3,565

Drake Thomas

Drake Thomas @MaskedTorah

Apr 16

In general I think this application of LLMs, as a way to verify some claim that depends on hard-to-share information by applying a judge of broadly well-understood character and discernment, is underutilized.

579

Drake Thomas

Drake Thomas @MaskedTorah

Apr 16

Much easier to trust "you didn't just blatantly lie about the result of this query to an LLM" than "your subjective judgement of this messy question, whose answer you have pressures to think should resolve in your favor, comes out to what a neutral party would have said."

432

Drake Thomas

Drake Thomas @MaskedTorah

Mar 2

In light of recent discourse around AI company employees' willingness to express dissent with their employer, I wanted to signal-boost this reply about a particular dynamic I experience fairly often. I suspect stories like this one are generally underrated in folks' models.

Drake Thomas @MaskedTorah

Mar 2

Replying to @bshlgrs @1a3orn @binarybits

(epistemic status: not trying to make sweeping claims about the nature of epistemic pressures within the company or saying this is the whole story, just anecdata about a failure mode I've personally experienced that I think is rarely discussed in these kinds of situations) One kind of awkward dynamic I've noticed in myself wrt public negative commentary on Anthropic comms is that in some ways it's hard because of high _internal_ reception for disagreement? Like, suppose that some senior person at Anthropic says a thing publicly and I think they had some bad takes or misleading narratives or whatever. If I oriented to them as a shadowy nefarious Authority who might punish me for speaking up, then whatever, screw them, that's an easy integrity call. But if they're just like some earnest nice person who I think is mistaken or wasn't particularly thoughtful in thinking about some consideration or whatever, then it feels super rude to go disparaging them on twitter when I haven't even brought it up to them directly and I know they would thoughtfully listen to the feedback and wouldn't dream of acting in a retaliatory way. But then, like, aw geez this senior person is extremely busy and this was kind of a minor point and I'd like to reword my totally unfiltered thoughts to be a little more tactful and charitable and avoid coming across like I think this was some kind of malicious act on their part when I really don't, and if I do send that DM or post in that thread then I should budget some time to reply to follow up conversations as a result, and it's probably worth checking with someone else internally to sanity check I'm not missing something silly that would make this a total waste of their time, and so the prospect of doing THAT is kind of friction-y and easy to put off when there's lots of other important work to do, and and and... So as a result it's not like I actually feel like My Opinions Are Suppressed but it does end up being the case that a smaller fraction of random negative takes I have make it out into the world because it feels like a lot of work to do well and it's easier to procrastinate them. I don't really know what to do about this kind of situation, other than to be the sort of person who puts really quite a lot of time into talking over this sort of thing with people anyway.

4,418

Drake Thomas

Drake Thomas @MaskedTorah

Feb 25

I want to be an equal-opportunity contrarian, so a bit of pushback on this pro-RSP v3 thread: I agree 2026 looks worse than 2023 for the chance of voluntary commitments converting into strong effective regulation, and this is one reason I'm less excited about RSPv1.0 than I was.

davidad 🎇

@davidad

Feb 25

Voluntary commitments to AI slowdowns were a nice idea in 2024 when it was plausible that they could be baby steps toward a multilateral agreement that would contain the intelligence explosion. For a variety of reasons this is no longer plausible. Anthropic is doing good here.

3,631

more replies

Drake Thomas

Drake Thomas @MaskedTorah

Feb 25

But I think there's still a case to be made here on more virtue-ethical grounds: it's much less morally risky to say, "we will never act so as to impose material catastrophic risk upon the world". You give up a lot of positive influence, but you have much less downside risk!

852

Drake Thomas

Drake Thomas @MaskedTorah

Feb 25

And I don't agree that racing looks quite so positive here (indeed the v3 RSP still has multiple measures to mitigate race dynamics); for one thing, if strong weight security is very hard then just building powerful models can empower evildoers too. x.com/davidad/status/2026721…

davidad 🎇

@davidad

Feb 25

Replying to @davidad

We are in a period of rapidly intensifying risk from AI-empowered evildoers, which can only be resolved by a coalition of aligned superintelligences. On current margins, the increased risk exposure of a slowdown outweighs the risk reduction created by more alignment progress.

790

Drake Thomas

Drake Thomas @MaskedTorah

Feb 25

Anthropic's RSP v3 is out! TLDR: unilateral commitments to specific mitigations for predefined capability thresholds are mostly out, in favor of commitments to much more detailed transparency around both safety roadmaps and risk reports. Also new threat models, new commitments around competitor progress and external review, a vision for industry-wide safety, increased attention on the risks of internal deployment - there's a ton of new stuff. I'm pretty excited about this change, think it's a big improvement on v2.2, and also do not really think you can fit a good overall take on the update into 280 chars. Assorted thoughts:

Anthropic

@AnthropicAI

Feb 24

We're updating our Responsible Scaling Policy to its third version. Since it came into effect in 2023, we’ve learned a lot about the RSP’s benefits and its shortcomings. This update improves the policy, reinforcing what worked and committing us to even greater transparency.

9,236

more replies

Drake Thomas

Drake Thomas @MaskedTorah

Feb 25

Please actually read and criticize it! Gripe about the ambiguity of the roadmaps! Run experiments to cast doubt on risk report methodology! I can name three significant complaints I have with the RSP off the top of my head and I expect to see none of them on X, prove me wrong!

815

Drake Thomas

Drake Thomas @MaskedTorah

Feb 25

(5) I highly highly recommend reading Holden's post on the motivation and reasoning behind the changes here; I expect half my twitter replies for the next week to be linking people to this post. It's really extremely good you guys. lesswrong.com/posts/HzKuzrKf…

Responsible Scaling Policy v3 — LessWrong

All views are my own, not Anthropic’s. This post assumes Anthropic’s announcement of RSP v3.0 as background. …

lesswrong.com

601

Drake Thomas

Drake Thomas @MaskedTorah

Feb 1

Why is morning sickness a thing, evolutionarily? I can understand why bodily quirks like hiccups would take a while for natural selection to clear up, but "lose a large fraction of caloric intake during early gestation" seems really quite costly to fitness!

1,085

Drake Thomas

Drake Thomas @MaskedTorah

Feb 1

The LLMs talk a bit about avoiding risky food groups or suppressed immune systems or something, but I don't really buy it - a threshold that let you eat at all in the ancestral environment should basically never trigger with first world levels of variety and foodborne illness.

842

Drake Thomas

Drake Thomas @MaskedTorah

Feb 1

I think all of "pretraining", "preheat", and "precommit"* are most commonly used in a context where omitting the "pre" would be clearer. What's up with that? Are there more examples? *in the decisionmaking sense, not the git sense

2,450

Drake Thomas

Drake Thomas @MaskedTorah

Feb 1

TBC, I think I basically understand the story for why each of these specific examples came into common usage. But having explained each one, I'm still like, huh, what's going on that this particular kind of linguistic failure mode is such an attractor?

563

Drake Thomas

Drake Thomas @MaskedTorah

2 Dec 2025

I think colloquial English "or" actually basically never means XOR, and isn't well-modeled as a logical operator of any sort - the datatype of a response to "do you want A or B" isn't a boolean, it's to say "A" or to say "B".

iwsfutcmd will see you at VC5

@iwsfutcmd

2 Dec 2025

oh i did just think of a novel conjunction, that i've heard in normal spoken English: "and/or" it fills a lexical gap, because to the chagrin of Boolean logicians, "or" doesn't mean OR, it means XOR so if you actually *do* want logical OR, you use "and/or" instead if the waiter asks if you want "soup or salad", you do not get to have both. if your girl asks you, "who do you love? me or her?" she is not suddenly inviting you into polyamory otoh if she asks, "me and/or her?" your life has suddenly become much easier and/or more complicated (it'd be kinda neat if we decided to start spelling "and/or" as "andor" though, we really don't need the slash. although maybe "andore" would be a better spelling to match the pronunciation)

2,323

Drake Thomas

Drake Thomas @MaskedTorah

2 Dec 2025

It's more like "I am enforcing/telling you that XOR(A,B); please choose a valid truth assignment."

652