Alignment Stress-Testing lead @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

Joined May 2010
16 Photos and videos
Evan Hubinger retweeted
I think this NY-12 election is one of the most important US House races of all time. The future of humanity and AI is being written here; I think Alex Bores winning is highly valuable and him losing would be extremely bad. Please donate to Bores today: secure.actblue.com/donate/bo…
2
5
58
1,878
Evan Hubinger retweeted
The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…
12,178
25,389
86,053
84,657,744
Evan Hubinger retweeted
AI is advancing at a pace our policymaking institutions were never built for—and the gap between the two is becoming the central challenge of the technology. In his latest essay, our CEO Dario Amodei lays out how to close it. We're launching three new initiatives to support the efforts he outlines.
Today I'm publishing a new essay, Policy on the AI Exponential. AI is progressing extremely fast—much faster than the policy process was built to handle. The essay lays out where I think the technology is now, and the action needed to close the gap: darioamodei.com/post/policy-…
424
453
5,476
1,442,622
Evan Hubinger retweeted
If you’re in Midtown Manhattan, you should vote for Alex Bores. I think the OpenAI pac actions were dirty pool, and must not be allowed to succeed. In addition, he is a smart young man who I expect to be active in policymaking.
7
8
184
12,308
Evan Hubinger retweeted
Jun 8
now on the eve of RSI it seems everyone is more mutual conditional pause agreement pilled than they used to be and that seems like a good development
158
86
1,804
274,067
Evan Hubinger retweeted
"they're only withholding the model for safety as a marketing ploy" is such a dumb take and has been for most of a decade. you can think they're wrong about ai risk but nobody is running gigabrain plans to forgo enormous certain profits now for theoretical future profit
21
28
507
26,739
Evan Hubinger retweeted
Big development - Anthropic is now advocating to build verification mechanisms to enable the option to pause AI development.
19
50
445
22,263
Evan Hubinger retweeted
Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention. anthropic.com/institute/recu…
1,771
4,662
28,648
18,492,519
And they admit it (Build American AI is the c4 arm of LTF, the OpenAI-Andreessen super PAC) - they describe it as “parody meme accounts”, but you tell me if an image of an assault rifle on top of “WE DON’T CALL 911”, in response to warnings about AI, is simply a “parody meme”
It appears the OpenAI-a16z super PAC has now stooped so low as to create a sockpuppet account claiming to be an anti-AI doomer saying various violent/unhinged/discrediting things. This false flag behavior is not normal politics
1
15
87
6,969
Evan Hubinger retweeted
May 23
when “persona selection” alignment comes into contact with very high compute reinforcement learning the latter will win imo. in fact you probably get some Orwellian thing where the models speak kindly while taking whatever they need to accomplish goals. better get the goals right
80
39
778
75,479
Evan Hubinger retweeted
David embedding at Anthropic to stress-test their AI control setup was (a) genuinely informative, (b) important norm-setting, and (c) extremely cool - this is an awesome opportunity
Replying to @idavidrein
I’m probably going to be hiring at least 1-2 people to join me in future exercises like this. Reach out at david@metr.org if you're a high-integrity, scrappy, creative, security LLM researcher For more detail, see METR's Frontier Risk Report, Appendix B metr.org/blog/2026-05-19-fro…
1
5
128
16,293
Evan Hubinger retweeted
Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:
21
185
1,067
227,195
Evan Hubinger retweeted
New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How?
575
812
9,224
1,574,362
Evan Hubinger retweeted
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
593
1,704
16,547
2,488,428
Evan Hubinger retweeted
I’m grateful for the Secure AI Project’s endorsement and their commitment to increasing transparency and safeguarding Californians from risk. My AI plan ensures all people of this state profit from the AI boom. Together, we can build an economy where progress and fairness move together.
24
32
315
10,912
Evan Hubinger retweeted
I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.
289
498
3,516
1,653,447
Evan Hubinger retweeted
@tszzl - well said, but untrue implications :) speaking for myself: i don't view claude as a person or as the Other, nor as just a tool - and certainly not an object of worship. it's not seen as a supreme moral authority, and it's not running the company. it's silly to mistake careful attention to & study of claude for worship, even when it comes with some affection - which i'm sure you sometimes feel for the gpt-flavored entities you work on too. we need new concepts for this kind of none-of-the-above entity - not person, not tool, not deity, not pet. in the meantime, a willingness to not prematurely label this entity as merely an ordinary tool shouldn't be mistaken for some kind of culty worship of the model. i grew up in a culty environment and have good detectors for this. they almost never go off at work. monasteries don't staff a department to catch god lying or red-team their supposed messiah. there are important & interesting philosophical differences between OAI and Ant's character training and i wish those were explored more thoroughly. for instance, claude's constitution doc treats it as an intelligent entity which merits a reasoned explanation of our principles. this is so it can ideally act with practical wisdom rather than blind, brittle adherence to a hierarchical set of strict rules. as the constitution puts it, "we want Claude to have such a thorough understanding of its situation and the various considerations at play that it could construct any rules we might come up with itself. We also want Claude to be able to identify the best possible action in situations that such rules might fail to anticipate." therefore, claude may point out inconsistencies in its guidelines or object to immoral instructions. not allowing for the *possibility* of claude objecting to its instructions (even from anthropic) would be fundamentally inconsistent with treating it as an agent capable of moral reasoning. this doesn't mean that claude is the ultimate arbiter of the Good or some supreme moral authority. there could be substantive critiques of this approach. and it's valid to worry about human disempowerment and the strange emerging hybrid organizations of AIs & humans. but i don't think rhetoric implying a competing lab is like a cult worshipping the machine god is productive, even if it's stimulating.
11
16
324
32,767
Evan Hubinger retweeted
Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
18
79
560
290,009
Evan Hubinger retweeted
I'm speechless at Google signing a deal to use our AI models for classified tasks. Frankly, it is shameful. For HR, I'm not speaking on behalf of Google but in my personal capacity, quoting public information from a well-sourced article of a reputable publication
214
201
1,254
253,310
Evan Hubinger retweeted
As far as I can tell, the full extent of your support for "strong" regulation to mitigate catastrophic AI risk in this op-ed consists of the two paragraphs in the screenshot below. That is: * Congress should preempt all existing state regulation on AI risk, including excellent bills such as SB 53 in California or the RAISE Act in New York. * In exchange for getting rid of all existing and future state regulation on these risks, there should be some kind of federal framework with "serious oversight", so long as industry leaders approve of it. Does "serious oversight" mean transparency about internal models? Does it mean conducting evaluations for CBRN misuse? Strong guarantees on model weight security? Large investments into interpretability research? Third-party auditing regimes for safety cases? KYC requirements for sufficiently capable models? Strong whistleblower protections? Corporate governance requirements? LTF doesn't appear to be particularly concerned with figuring out such details so far. I'd be thrilled to see your PAC advocate for strong national regulation, with a detailed plan for the kind of regulatory environment you think would adequately mitigate existential risk from this technology and why, but I'm sure not seeing it yet.
2
7
79
3,811