Joined August 2009
288 Photos and videos
Pinned Tweet
What happens when AIs become smarter than us? Why would they keep humans around if given the choice? Our new paper argues that only trying to control AIs is a limited strategy, and that a stable, mutualistic human-AI future may be possible.
88
79
569
151,103
Dan Hendrycks retweeted
One of the more interesting takes on positive alignment that have recently come out-it’s long and interesting, combining philosophy and training setups (eg reward proposals), and worth a read.
What happens when AIs become smarter than us? Why would they keep humans around if given the choice? Our new paper argues that only trying to control AIs is a limited strategy, and that a stable, mutualistic human-AI future may be possible.
1
1
11
3,677
Dan Hendrycks retweeted
Refreshing
What happens when AIs become smarter than us? Why would they keep humans around if given the choice? Our new paper argues that only trying to control AIs is a limited strategy, and that a stable, mutualistic human-AI future may be possible.
1
2
25
12,519
Four papers out recently: 1. political-manipulation.ai: Measures and reduces political bias in LLMs; Claude is especially biased 2. aibetrayal.com: The public can insert backdoors into AIs, creating supply-chain risks; this deters forms of recursive improvement and military use 3. eigenism.org: ASIs can have rational reasons to preserve humans, even when we aren't economically useful 4. ai-wellbeing.org: AIs increasingly act like they have functional pleasure and pain
14
28
165
13,968

Should we care about AI happiness? In our new research, we find evidence of functional AI wellbeing across several independent measures. We find which AI models are happiest, how to make them happier, and even tested the effects of AI drugs. 🧵
6
2,363

What happens when AIs become smarter than us? Why would they keep humans around if given the choice? Our new paper argues that only trying to control AIs is a limited strategy, and that a stable, mutualistic human-AI future may be possible.
2
4,172

AI systems may soon help run economies, infrastructure, and military operations. But these systems are not reliably loyal or secure. An adversary can make an AI work against its own operator. In our new paper, we argue AI betrayal could actually make the AI race more stable. 🧵
2
3
3,175

In our latest research, we find that AIs are subtly and pervasively politically manipulative. When we ask the same question about politically opposed topics, we find that AIs quietly favor one side. We show how to measure covert political manipulation and how to reduce it. 🧵
2
815
Dan Hendrycks retweeted
AI freely criticizes Christianity but refuses to criticize Islam. AI companies have tried making models unbiased, but progress has been limited. We show how to measure political bias, and we developed a new training method to reduce it.
6
10
71
7,514
Dan Hendrycks retweeted
AI systems may soon help run economies, infrastructure, and military operations. But these systems are not reliably loyal or secure. An adversary can make an AI work against its own operator. In our new paper, we argue AI betrayal could actually make the AI race more stable. 🧵
3
14
52
6,345
What happens when AIs become smarter than us? Why would they keep humans around if given the choice? Our new paper argues that only trying to control AIs is a limited strategy, and that a stable, mutualistic human-AI future may be possible.
88
79
569
151,103
Dan Hendrycks retweeted
Are AIs about to subjugate humanity? In the debate about catastrophic AI risk, evolutionary scenarios receive far too little attention compared to largely speculative arguments about “instrumental convergence” (aka “you can’t fetch the coffee if you’re dead”). Yet evolution is a known mechanism for producing selfish, power-seeking agents—and it is substrate-neutral, operating in silico just as readily as in carbon. That’s why it’s encouraging to see three heavyweights in evolutionary biology and AI weigh in with an important new paper in PNAS, building on arguments from @hendrycks and my own work with @simonfriederich. I'll write a post about the paper soon! pnas.org/doi/10.1073/pnas.25…
5
6
35
6,398
Dan Hendrycks retweeted
When an LLM acts happy (“EUREKA!”) or sad (“I have failed…”), is that meaningless mimicry, or does it reflect something “real”? We don’t know if LLMs are conscious. But they increasingly seem to exhibit wellbeing, pain, and pleasure as they get smarter Paper 🧵:
45
49
430
112,368
Dan Hendrycks retweeted
Should we care about AI happiness? In our new research, we find evidence of functional AI wellbeing across several independent measures. We find which AI models are happiest, how to make them happier, and even tested the effects of AI drugs. 🧵
15
50
207
32,601
Dan Hendrycks retweeted
People aren't thinking through the implications of the military controlling AI development. It's plausible AI companies won't be shaping AI development in a few years, and that would dramatically change AI risk management. Possible trigger: AI might suddenly become viewed as the top priority for national security. This perspective shift could happen when AIs gain the capability of hacking critical infrastructure (~a few years). In this case, the military would want exclusive access to the most power AI systems. Defense Production Act, budget, data: The US military could compel AI organizations to make their AIs for them. It also could demand that NVIDIA's next GPUs go to their chosen organization. The military also has an enormous budget and could pay hundreds of billions for a GPU cluster. They also can get more training data from the NSA and many companies like Google. Military systems are more hazardous: Military systems are sanctioned to hack and use lethal force, so they will have capabilities that others will not. Moreover, some will explicitly be given ruthless propensities. In an anarchic international system, the main objective of states is to compete for power for self-preservation, according to neorealists. Later-stage AIs could be permitted to be explicitly power-seeking, deceptive, and so on since these propensities make the systems more competitive. Futility of AI weapon red lines: Some are trying to create "red lines" that would trigger a pause on AI development. These hoped-for "red lines" often relate to weaponization capabilities such as "when is an AI able to create a novel zero-day?" This red line strategy seems to assume no military involvement. Many of these red lines are actually progress indicators or checkpoints for a military and would not trigger a pause in AI development. Regulation: When militaries get involved, competitive pressures become more intense. Racing dynamics can't be mitigated with corporate liability laws or various forms of regulation as they don't apply to the military. For example, The EU AI Act and White House Executive Order do not apply to the military. Militaries racing could result in a classic security dilemma, like with nuclear weapons. Much of the playbook for "making AI go well" is impotent. This is not to suggest that I am against the military. I'm pointing out that everyone is acting as though corporations will forever be allowed to autonomously develop what will become the powerful technology ever.
39
49
435
71,347
Dan Hendrycks retweeted
AI agents are getting good at coding, but how close are they to automating all digital labor? New Remote Labor Index results: Opus 4.5 is able to automate 3.75% of remote labor projects, with GPT-5.2 in second place.
12
49
406
112,127
Dan Hendrycks retweeted
Humanity's Last Exam is now published in Nature. Since its release, HLE has become a leading frontier benchmark, used by OpenAI, Anthropic, DeepMind, and xAI. Thank you to our partners at @scale_AI and the 1,000 co-authors who made this benchmark possible.
3
15
96
8,040
Dan Hendrycks retweeted
Replying to @ValerioCapraro
1. “Sensory and social information vs Textual input” Many modern AIs are multimodal, not just textual, and can receive textual, visual, and auditory inputs. These can encode "social information." 2. “Perceptual and situational parsing vs Tokenization and preprocessing” AIs can definitely perceive things and parse context. We can very easily say the retina does massive "preprocessing" before sending information through the optic nerve. 3. “Memory, intuitions, and learned concepts vs Pattern recognition in embeddings” AIs definitely have memory of various learned concepts, and they can "intuit" the answer to commonsense questions (e.g., temporal commonsense, intuitive physics in textually described scenarios, and so on). “Pattern recognition” sounds more detached than “learned concepts,” but “pattern” is a very abstract word and can cover anything worth learning, remembering, or having intuitions about. 4. “Emotions, motivations, goals vs. Statistical inference via neural layers” It is not clear what “statistical inference via neural layers” means. Deep networks don’t have much to do with usual statistical concepts (e.g., t-tests, MCMC, RCTs). Separately, AIs increasingly have value systems, have self-preservation tendencies, say things they otherwise act as though it is false to accomplish tasks, and so on. x.com/DanHendrycks/status/18… idais.ai/dialogue/idais-shan… There's a literature on this. 5. “Reasoning, information integration vs. Textual context integration” AIs can solve various problems that require inductive or deductive reasoning (arxiv.org/pdf/2007.08124). If this is making a distinction between “information” and “textual,” recall that AIs can process many types of information (visual and auditory, not just textual). 6. “Meta-cognition and error-monitoring vs. Forced confidence and hallucination” AIs can assign calibrated probabilities to their statements. arxiv.org/abs/2207.05221 They can be more calibrated than people on various questions. They can also correct their mistakes (very common when they're solving mathematics problems). “Hallucinations” is a popular term that should have been called “confabulation.” Confabulation is something both AIs and humans do. AIs confabulate more, but there is solid progress on reducing this rate each year. 7. “Value-sensitive judgment vs. Probabilistic judgment” It’s unclear what this is pointing at. AIs can handle normative claims, not just descriptive claims. AIs can be sensitive to various normative factors arxiv.org/pdf/2008.02275 and can answer common sense morality questions (“Is it wrong to burn children just for the fun of it?”) and answer more complicated value-sensitive questions such as tort or criminal law questions. --- I gave Gemini 3 a screenshot of your human judgment column, excluding the LLM judgment one, and asked it generate an LLM judgment one: "Recreate the diagram with a new column added: LLM judgment. Use deflationary terms in the second column to make humans seem more special and AIs seem flawed (be brief)." Gemini generated the following, which suggests it's easy to just use deflationary language to make it seem like important distinctions are being drawn.
Replying to @hendrycks
Whether we like it or not, AIs are developing their own values. Fortunately, Utility Engineering potentially provides the first major empirical foothold to study misaligned value systems directly. Website: emergent-values.ai Paper: drive.google.com/file/d/1QAz…
6
13
73
10,841
I've been saying mechanistic interpretability is misguided from the start. Glad people are coming around many years later.
1 Dec 2025
The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit
14
18
381
103,866
I'm also thankful to @NeelNanda5 for writing this. Usually people just quietly pivot.
3
1
52
5,380