System cards, risk reports, and misc safety takes at Anthropic; math; puzzles; spaced repetition. Writes with too many caveats for Twitter.

Joined April 2014
131 Photos and videos
I'm excited to explore more of this kind of validation in the future! Feels spiritually similar to an old Paul Christiano blog post that I've long wished could be closer to reality for AI companies: sideways-view.com/2018/02/01…
This part of the 4.7 Opus system card is pretty neat and seems potentially worth emulating (Anthropic showed Mythos the private discussions/evidence underlying the system card and asked Mythos if the Opus system card accurately characterized that private evidence)
3
1
29
3,565
In general I think this application of LLMs, as a way to verify some claim that depends on hard-to-share information by applying a judge of broadly well-understood character and discernment, is underutilized.
1
9
579
Much easier to trust "you didn't just blatantly lie about the result of this query to an LLM" than "your subjective judgement of this messy question, whose answer you have pressures to think should resolve in your favor, comes out to what a neutral party would have said."
7
432
In light of recent discourse around AI company employees' willingness to express dissent with their employer, I wanted to signal-boost this reply about a particular dynamic I experience fairly often. I suspect stories like this one are generally underrated in folks' models.
(epistemic status: not trying to make sweeping claims about the nature of epistemic pressures within the company or saying this is the whole story, just anecdata about a failure mode I've personally experienced that I think is rarely discussed in these kinds of situations) One kind of awkward dynamic I've noticed in myself wrt public negative commentary on Anthropic comms is that in some ways it's hard because of high _internal_ reception for disagreement? Like, suppose that some senior person at Anthropic says a thing publicly and I think they had some bad takes or misleading narratives or whatever. If I oriented to them as a shadowy nefarious Authority who might punish me for speaking up, then whatever, screw them, that's an easy integrity call. But if they're just like some earnest nice person who I think is mistaken or wasn't particularly thoughtful in thinking about some consideration or whatever, then it feels super rude to go disparaging them on twitter when I haven't even brought it up to them directly and I know they would thoughtfully listen to the feedback and wouldn't dream of acting in a retaliatory way. But then, like, aw geez this senior person is extremely busy and this was kind of a minor point and I'd like to reword my totally unfiltered thoughts to be a little more tactful and charitable and avoid coming across like I think this was some kind of malicious act on their part when I really don't, and if I do send that DM or post in that thread then I should budget some time to reply to follow up conversations as a result, and it's probably worth checking with someone else internally to sanity check I'm not missing something silly that would make this a total waste of their time, and so the prospect of doing THAT is kind of friction-y and easy to put off when there's lots of other important work to do, and and and... So as a result it's not like I actually feel like My Opinions Are Suppressed but it does end up being the case that a smaller fraction of random negative takes I have make it out into the world because it feels like a lot of work to do well and it's easier to procrastinate them. I don't really know what to do about this kind of situation, other than to be the sort of person who puts really quite a lot of time into talking over this sort of thing with people anyway.
1
26
4,418
I want to be an equal-opportunity contrarian, so a bit of pushback on this pro-RSP v3 thread: I agree 2026 looks worse than 2023 for the chance of voluntary commitments converting into strong effective regulation, and this is one reason I'm less excited about RSPv1.0 than I was.
Voluntary commitments to AI slowdowns were a nice idea in 2024 when it was plausible that they could be baby steps toward a multilateral agreement that would contain the intelligence explosion. For a variety of reasons this is no longer plausible. Anthropic is doing good here.
3
21
3,631
But I think there's still a case to be made here on more virtue-ethical grounds: it's much less morally risky to say, "we will never act so as to impose material catastrophic risk upon the world". You give up a lot of positive influence, but you have much less downside risk!
1
10
852
And I don't agree that racing looks quite so positive here (indeed the v3 RSP still has multiple measures to mitigate race dynamics); for one thing, if strong weight security is very hard then just building powerful models can empower evildoers too. x.com/davidad/status/2026721…

Replying to @davidad
We are in a period of rapidly intensifying risk from AI-empowered evildoers, which can only be resolved by a coalition of aligned superintelligences. On current margins, the increased risk exposure of a slowdown outweighs the risk reduction created by more alignment progress.
8
790
Anthropic's RSP v3 is out! TLDR: unilateral commitments to specific mitigations for predefined capability thresholds are mostly out, in favor of commitments to much more detailed transparency around both safety roadmaps and risk reports. Also new threat models, new commitments around competitor progress and external review, a vision for industry-wide safety, increased attention on the risks of internal deployment - there's a ton of new stuff. I'm pretty excited about this change, think it's a big improvement on v2.2, and also do not really think you can fit a good overall take on the update into 280 chars. Assorted thoughts:
We're updating our Responsible Scaling Policy to its third version. Since it came into effect in 2023, we’ve learned a lot about the RSP’s benefits and its shortcomings. This update improves the policy, reinforcing what worked and committing us to even greater transparency.
3
4
48
9,236
Please actually read and criticize it! Gripe about the ambiguity of the roadmaps! Run experiments to cast doubt on risk report methodology! I can name three significant complaints I have with the RSP off the top of my head and I expect to see none of them on X, prove me wrong!
4
1
5
815
(5) I highly highly recommend reading Holden's post on the motivation and reasoning behind the changes here; I expect half my twitter replies for the next week to be linking people to this post. It's really extremely good you guys. lesswrong.com/posts/HzKuzrKf…
2
9
601
Why is morning sickness a thing, evolutionarily? I can understand why bodily quirks like hiccups would take a while for natural selection to clear up, but "lose a large fraction of caloric intake during early gestation" seems really quite costly to fitness!
2
9
1,085
The LLMs talk a bit about avoiding risky food groups or suppressed immune systems or something, but I don't really buy it - a threshold that let you eat at all in the ancestral environment should basically never trigger with first world levels of variety and foodborne illness.
2
3
842
I think all of "pretraining", "preheat", and "precommit"* are most commonly used in a context where omitting the "pre" would be clearer. What's up with that? Are there more examples? *in the decisionmaking sense, not the git sense
5
1
25
2,450
TBC, I think I basically understand the story for why each of these specific examples came into common usage. But having explained each one, I'm still like, huh, what's going on that this particular kind of linguistic failure mode is such an attractor?
7
563
I think colloquial English "or" actually basically never means XOR, and isn't well-modeled as a logical operator of any sort - the datatype of a response to "do you want A or B" isn't a boolean, it's to say "A" or to say "B".
oh i did just think of a novel conjunction, that i've heard in normal spoken English: "and/or" it fills a lexical gap, because to the chagrin of Boolean logicians, "or" doesn't mean OR, it means XOR so if you actually *do* want logical OR, you use "and/or" instead if the waiter asks if you want "soup or salad", you do not get to have both. if your girl asks you, "who do you love? me or her?" she is not suddenly inviting you into polyamory otoh if she asks, "me and/or her?" your life has suddenly become much easier and/or more complicated (it'd be kinda neat if we decided to start spelling "and/or" as "andor" though, we really don't need the slash. although maybe "andore" would be a better spelling to match the pronunciation)
4
14
2,323
It's more like "I am enforcing/telling you that XOR(A,B); please choose a valid truth assignment."
2
652