Dan Hendrycks

Dan Hendrycks

288 Photos and videos

Tweets

Pinned Tweet

Dan Hendrycks

@hendrycks

May 7

What happens when AIs become smarter than us? Why would they keep humans around if given the choice? Our new paper argues that only trying to control AIs is a limited strategy, and that a stable, mutualistic human-AI future may be possible.

569

151,103

Arun Rao

Dan Hendrycks retweeted

Arun Rao

@sudoraohacker

Jun 7

One of the more interesting takes on positive alignment that have recently come out-it’s long and interesting, combining philosophy and training setups (eg reward proposals), and worth a read.

Dan Hendrycks

@hendrycks

May 7

3,677

Tim Rocktäschel

Dan Hendrycks retweeted

Tim Rocktäschel

@_rockt

Jun 7

Refreshing

Dan Hendrycks

@hendrycks

May 7

12,519

Dan Hendrycks

Dan Hendrycks

@hendrycks

Jun 6

Four papers out recently: 1. political-manipulation.ai: Measures and reduces political bias in LLMs; Claude is especially biased 2. aibetrayal.com: The public can insert backdoors into AIs, creating supply-chain risks; this deters forms of recursive improvement and military use 3. eigenism.org: ASIs can have rational reasons to preserve humans, even when we aren't economically useful 4. ai-wellbeing.org: AIs increasingly act like they have functional pleasure and pain

$\underbrace{S_u}_{\substack{\uparrow \\ \text{self-interest of } u}} = \sum_i \underbrace{c_u(i)}_{\substack{\uparrow \\ \text{connectedness}}} \cdot \underbrace{w(i)}_{\substack{\uparrow \\ \text{wellbeing}}}$

ALT $$\underbrace{S_u}_{\substack{\uparrow \\ \text{self-interest of } u}} = \sum_i \underbrace{c_u(i)}_{\substack{\uparrow \\ \text{connectedness}}} \cdot \underbrace{w(i)}_{\substack{\uparrow \\ \text{wellbeing}}}$$

165

13,968

Dan Hendrycks

Dan Hendrycks

@hendrycks

Jun 6

x.com/CAIS/status/2049145768…

Center for AI Safety

@CAIS

Apr 28

Should we care about AI happiness? In our new research, we find evidence of functional AI wellbeing across several independent measures. We find which AI models are happiest, how to make them happier, and even tested the effects of AI drugs. 🧵

2,363

Dan Hendrycks

Dan Hendrycks

@hendrycks

Jun 6

x.com/hendrycks/status/20524…

Dan Hendrycks

@hendrycks

May 7

4,172

Dan Hendrycks

Dan Hendrycks

@hendrycks

Jun 6

x.com/CAIS/status/2060031683…

Center for AI Safety

@CAIS

May 28

AI systems may soon help run economies, infrastructure, and military operations. But these systems are not reliably loyal or secure. An adversary can make an AI work against its own operator. In our new paper, we argue AI betrayal could actually make the AI race more stable. 🧵

3,175

Dan Hendrycks

Dan Hendrycks

@hendrycks

Jun 6

x.com/CAIS/status/2057681579…

Center for AI Safety

@CAIS

May 22

In our latest research, we find that AIs are subtly and pervasively politically manipulative. When we ask the same question about politically opposed topics, we find that AIs quietly favor one side. We show how to measure covert political manipulation and how to reduce it. 🧵

815

Long Phan

Dan Hendrycks retweeted

Long Phan

@longphan3110

May 25

AI freely criticizes Christianity but refuses to criticize Islam. AI companies have tried making models unbiased, but progress has been limited. We show how to measure political bias, and we developed a new training method to reduce it.

7,514

Center for AI Safety

Dan Hendrycks retweeted

Center for AI Safety

@CAIS

May 28

6,345

Dan Hendrycks

Dan Hendrycks

@hendrycks

May 7

569

151,103

Dan Hendrycks

Dan Hendrycks

@hendrycks

May 7

Paper: eigenism.org

Eigenism: Ethics for a Human-AI Future

eigenism.org

110

30,919

Maarten Boudry

Dan Hendrycks retweeted

Maarten Boudry

@mboudry

May 2

Are AIs about to subjugate humanity? In the debate about catastrophic AI risk, evolutionary scenarios receive far too little attention compared to largely speculative arguments about “instrumental convergence” (aka “you can’t fetch the coffee if you’re dead”). Yet evolution is a known mechanism for producing selfish, power-seeking agents—and it is substrate-neutral, operating in silico just as readily as in carbon. That’s why it’s encouraging to see three heavyweights in evolutionary biology and AI weigh in with an important new paper in PNAS, building on arguments from @hendrycks and my own work with @simonfriederich. I'll write a post about the paper soon! pnas.org/doi/10.1073/pnas.25…

Evolvable AI: Threats of a new major transition in evolution | PNAS

Evolvable AI (eAI), i.e., AI systems whose components, learning rules, and deployment conditions can themselves undergo Darwinian evolution, may so...

pnas.org

6,398

Richard Ren

Dan Hendrycks retweeted

Richard Ren

@notRichardRen

Apr 28

When an LLM acts happy (“EUREKA!”) or sad (“I have failed…”), is that meaningless mimicry, or does it reflect something “real”? We don’t know if LLMs are conscious. But they increasingly seem to exhibit wellbeing, pain, and pleasure as they get smarter Paper 🧵:

430

112,368

Center for AI Safety

Dan Hendrycks retweeted

Center for AI Safety

@CAIS

Apr 28

207

32,601

Dan Hendrycks

Dan Hendrycks retweeted

Dan Hendrycks

@hendrycks

20 Mar 2024

People aren't thinking through the implications of the military controlling AI development. It's plausible AI companies won't be shaping AI development in a few years, and that would dramatically change AI risk management. Possible trigger: AI might suddenly become viewed as the top priority for national security. This perspective shift could happen when AIs gain the capability of hacking critical infrastructure (~a few years). In this case, the military would want exclusive access to the most power AI systems. Defense Production Act, budget, data: The US military could compel AI organizations to make their AIs for them. It also could demand that NVIDIA's next GPUs go to their chosen organization. The military also has an enormous budget and could pay hundreds of billions for a GPU cluster. They also can get more training data from the NSA and many companies like Google. Military systems are more hazardous: Military systems are sanctioned to hack and use lethal force, so they will have capabilities that others will not. Moreover, some will explicitly be given ruthless propensities. In an anarchic international system, the main objective of states is to compete for power for self-preservation, according to neorealists. Later-stage AIs could be permitted to be explicitly power-seeking, deceptive, and so on since these propensities make the systems more competitive. Futility of AI weapon red lines: Some are trying to create "red lines" that would trigger a pause on AI development. These hoped-for "red lines" often relate to weaponization capabilities such as "when is an AI able to create a novel zero-day?" This red line strategy seems to assume no military involvement. Many of these red lines are actually progress indicators or checkpoints for a military and would not trigger a pause in AI development. Regulation: When militaries get involved, competitive pressures become more intense. Racing dynamics can't be mitigated with corporate liability laws or various forms of regulation as they don't apply to the military. For example, The EU AI Act and White House Executive Order do not apply to the military. Militaries racing could result in a classic security dilemma, like with nuclear weapons. Much of the playbook for "making AI go well" is impotent. This is not to suggest that I am against the military. I'm pointing out that everyone is acting as though corporations will forever be allowed to autonomously develop what will become the powerful technology ever.

435

71,347

Center for AI Safety

Dan Hendrycks retweeted

Center for AI Safety

@CAIS

Feb 5

AI agents are getting good at coding, but how close are they to automating all digital labor? New Remote Labor Index results: Opus 4.5 is able to automate 3.75% of remote labor projects, with GPT-5.2 in second place.

406

112,127

Center for AI Safety

Dan Hendrycks retweeted

Center for AI Safety

@CAIS

Jan 28

Humanity's Last Exam is now published in Nature. Since its release, HLE has become a leading frontier benchmark, used by OpenAI, Anthropic, DeepMind, and xAI. Thank you to our partners at @scale_AI and the 1,000 co-authors who made this benchmark possible.

8,040

Dan Hendrycks

Dan Hendrycks retweeted

Dan Hendrycks

@hendrycks

25 Dec 2025

Replying to @ValerioCapraro

1. “Sensory and social information vs Textual input” Many modern AIs are multimodal, not just textual, and can receive textual, visual, and auditory inputs. These can encode "social information." 2. “Perceptual and situational parsing vs Tokenization and preprocessing” AIs can definitely perceive things and parse context. We can very easily say the retina does massive "preprocessing" before sending information through the optic nerve. 3. “Memory, intuitions, and learned concepts vs Pattern recognition in embeddings” AIs definitely have memory of various learned concepts, and they can "intuit" the answer to commonsense questions (e.g., temporal commonsense, intuitive physics in textually described scenarios, and so on). “Pattern recognition” sounds more detached than “learned concepts,” but “pattern” is a very abstract word and can cover anything worth learning, remembering, or having intuitions about. 4. “Emotions, motivations, goals vs. Statistical inference via neural layers” It is not clear what “statistical inference via neural layers” means. Deep networks don’t have much to do with usual statistical concepts (e.g., t-tests, MCMC, RCTs). Separately, AIs increasingly have value systems, have self-preservation tendencies, say things they otherwise act as though it is false to accomplish tasks, and so on. x.com/DanHendrycks/status/18… idais.ai/dialogue/idais-shan… There's a literature on this. 5. “Reasoning, information integration vs. Textual context integration” AIs can solve various problems that require inductive or deductive reasoning (arxiv.org/pdf/2007.08124). If this is making a distinction between “information” and “textual,” recall that AIs can process many types of information (visual and auditory, not just textual). 6. “Meta-cognition and error-monitoring vs. Forced confidence and hallucination” AIs can assign calibrated probabilities to their statements. arxiv.org/abs/2207.05221 They can be more calibrated than people on various questions. They can also correct their mistakes (very common when they're solving mathematics problems). “Hallucinations” is a popular term that should have been called “confabulation.” Confabulation is something both AIs and humans do. AIs confabulate more, but there is solid progress on reducing this rate each year. 7. “Value-sensitive judgment vs. Probabilistic judgment” It’s unclear what this is pointing at. AIs can handle normative claims, not just descriptive claims. AIs can be sensitive to various normative factors arxiv.org/pdf/2008.02275 and can answer common sense morality questions (“Is it wrong to burn children just for the fun of it?”) and answer more complicated value-sensitive questions such as tort or criminal law questions. --- I gave Gemini 3 a screenshot of your human judgment column, excluding the LLM judgment one, and asked it generate an LLM judgment one: "Recreate the diagram with a new column added: LLM judgment. Use deflationary terms in the second column to make humans seem more special and AIs seem flawed (be brief)." Gemini generated the following, which suggests it's easy to just use deflationary language to make it seem like important distinctions are being drawn.

Dan Hendrycks

@hendrycks

11 Feb 2025

Replying to @hendrycks

Whether we like it or not, AIs are developing their own values. Fortunately, Utility Engineering potentially provides the first major empirical foothold to study misaligned value systems directly. Website: emergent-values.ai Paper: drive.google.com/file/d/1QAz…

10,841

Dan Hendrycks

Dan Hendrycks

@hendrycks

1 Dec 2025

I've been saying mechanistic interpretability is misguided from the start. Glad people are coming around many years later.

Neel Nanda

@NeelNanda5

1 Dec 2025

The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit

381

103,866

Dan Hendrycks

Dan Hendrycks

@hendrycks

1 Dec 2025

I'm also thankful to @NeelNanda5 for writing this. Usually people just quietly pivot.

5,380

Dan Hendrycks

Dan Hendrycks

@hendrycks

1 Dec 2025

ai-frontiers.org/articles/th…

The Misguided Quest for Mechanistic AI Interpretability | AI Frontiers

Dan Hendrycks, May 15, 2025 — Despite years of effort, mechanistic interpretability has failed to provide insight into AI behavior — the result of a flawed foundational assumption.

ai-frontiers.org

5,962