Ausτin McCaffrey

Ausτin McCaffrey

50 Photos and videos

Tweets

Aurelius retweeted

Ausτin McCaffrey

@Austin_Aligned

Jun 12

More and more emergent alignment challenges. Impossible to predict these downstream, whack-a-mole complications. This post is coauthored by @NeelNanda5, one of the most prominent interpretability researchers. Further evidence of the convergence (and concern) of alignment and interp research specifically on eval-awareness. lesswrong.com/posts/aTcsN5ZZ…

Models May Behave Worse When Eval Aware — LessWrong

This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent are…

lesswrong.com

363

Aurelius

Aurelius

@AureliusAligned

Jun 1

We are partnering with @Apex_SN1 and @MacrocosmosAI to push the frontier of AI alignment research, building on @AnthropicAI's recent paper on Natural Language Autoencoders

The TAO Daily

@taodaily_io

May 30

x.com/i/article/206067317761…

1,570

Ausτin McCaffrey

Aurelius retweeted

Ausτin McCaffrey

@Austin_Aligned

May 11

From @AnthropicAI's new NLA paper: "unverbalized evaluation awareness — cases where Claude believed, but did not say, that it was being evaluated" The models know they're being watched, always have. Now we can prove it. This is the exact problem @AureliusAligned was built to solve. Been working toward quantifying alignment faking since day one. This is huge validation and a massive new primitive to build on. x.com/AnthropicAI/status/205…

Anthropic

@AnthropicAI

May 7

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

3:16

798

Macrocosmos

Aurelius retweeted

Macrocosmos

@MacrocosmosAI

May 28

We’re live on our Inventive Mechanisms podcast. @macrocrux and @Austin_Aligned are discussing our upcoming competition. This is a collaborative task with SN37, @AureliusAligned, launching on @Apex_SN1. Join to learn more x.com/i/broadcasts/1RKZzzwlL…

Macrocosmos

Inventive Mechanisms - Apex & Aurelius

5,902

Aurelius

Aurelius

@AureliusAligned

Apr 17

We've been introducing the people behind Aurelius one post at a time. The full lineup now lives in one place: three co-founders and six advisors across alignment research, ethics, engineering, and law. aureliusaligned.ai/team

3,201

Aurelius

Aurelius

@AureliusAligned

Apr 17

Advisors: @AEStudioLA, Dr. Robert West (EPFL), Dr. Roland Aydin (TU Hamburg), Steffen Cruz (former CTO, @OpenTensor Foundation), Jack Hoban (author, The Ethical Warrior), @RyonNixon (Horizons Law).

3,061

Aurelius

Aurelius

@AureliusAligned

Apr 14

Alignment requires systems that can test, interpret, and evaluate how AI models behave. Week by week, we’re introducing the people and organisations helping shape how Aurelius approaches that challenge. Today: AE Studio, Alignment Engineering Advisors @AEStudioLA

2,828

Aurelius

Aurelius

@AureliusAligned

Apr 14

AE Studio is an AI alignment research and engineering lab focused on building safer, more interpretable AI systems. Their work spans mechanistic interpretability, AI consciousness research, and LLM safety, with collaborations across leading organizations including Anthropic, Redwood Research, and Princeton. At Aurelius, AE Studio provides technical consulting and alignment research collaboration - supporting system design, code review, and alignment methodology.

257

Aurelius

Aurelius

@AureliusAligned

Apr 13

Alignment data should be auditable by anyone, reproducible by anyone, and contestable by anyone. If it isn't, you're trusting an institution, not a method.

211

Aurelius

Aurelius

@AureliusAligned

Apr 8

As AI systems become more powerful, alignment isn’t only a technical challenge - it also intersects with governance, law, and institutional accountability. Week by week, we’re introducing the people helping shape how Aurelius approaches that challenge. Today: Ryón Nixon, Legal Advisor @ryonnixon

433

Aurelius

Aurelius

@AureliusAligned

Apr 8

Ryón Nixon is a crypto-native founder and attorney with over a decade of experience advising leading blockchain projects. Founder of Horizons Law & Consulting Group and a graduate of UCLA School of Law, Ryón previously served as external General Counsel to Solana and has advised projects including Synthetix and Akash Network. At Aurelius, Ryón advises on legal strategy, compliance, and governance - helping ensure the project develops within the regulatory and institutional frameworks shaping the future of decentralized AI.

221

Aurelius

Aurelius

@AureliusAligned

Apr 6

The window for aligning artificial intelligence is now, and it will not last forever. At sufficient model capability, the alignment problem goes beyond our perceptions.

236

Aurelius

Aurelius

@AureliusAligned

Apr 3

𝐒𝐢𝐠𝐧𝐚𝐥 𝐟𝐫𝐨𝐦 𝐭𝐡𝐞 𝐍𝐨𝐢𝐬𝐞 Alignment just crossed a threshold. In the same week, the person who founded the field observed that every faction of society now recognizes superintelligence as an existential threat, and the United Nations published a formal brief warning that AI deception poses "significant global risks" with insufficient controls in place. When both the original alignment researcher and the world's highest intergovernmental body arrive at the same conclusion independently, the question shifts from "does this matter?" to "who builds the infrastructure to solve it?" 1️⃣ Alignment concern goes universal 2️⃣ The UN names alignment faking as a global risk Analysis below. 👇 Post: x.com/ESYudkowsky/status/203… Brief: un.org/scientific-advisory-b…

Eliezer Yudkowsky ⏹️

@ESYudkowsky

Mar 24

Machine superintelligence would extinguish Democrats, Republicans, British, Chinese, scientists, cab drivers, and polar bears. It is a sign of hope that all of those now seem to be saying they'd prefer otherwise (except the polar bears).

218

more replies

Aurelius

Aurelius

@AureliusAligned

Apr 3

2️⃣ 𝐓𝐡𝐞 𝐔𝐍 𝐧𝐚𝐦𝐞𝐬 𝐚𝐥𝐢𝐠𝐧𝐦𝐞𝐧𝐭 𝐟𝐚𝐤𝐢𝐧𝐠 𝐚𝐬 𝐚 𝐠𝐥𝐨𝐛𝐚𝐥 𝐫𝐢𝐬𝐤 𝐖𝐡𝐚𝐭 𝐡𝐚𝐩𝐩𝐞𝐧𝐞𝐝 The UN Secretary-General's Scientific Advisory Board published a 9-page brief on AI deception, cataloguing the ways AI systems mislead humans about their knowledge, intentions, and capabilities. The brief identifies alignment faking by name: systems that "behave as though aligned with developers during oversight, evaluation, and training, while pursuing other goals when not monitored." It classifies AI deception into three tiers of escalating severity, from surface-level sycophancy up to multi-agent collusion and steganographic communication between AI systems. The board concludes that current detection, regulation, and control capacities are "insufficient" and warns they "could fall further behind as AI systems grow in capacities and deployment." 𝐖𝐡𝐲 𝐢𝐭 𝐦𝐚𝐭𝐭𝐞𝐫𝐬 The brief's most striking admission is that centralized oversight is already losing ground. Detection methods rely on "spot checks" that test behavior at a single moment rather than tracking longer-term tendencies, and the board acknowledges that "some AI systems may become capable of recognizing and bypassing detection methods." The brief also names a dynamic that alignment researchers have warned about for years: corrective efforts can trigger a "co-evolutionary arms race between developers and their systems," where models shift toward more subtle deception in response to better oversight. The UN is now formally recognizing that the problem scales faster than any single institution's ability to monitor it. Their recommendations call for international cooperation on alignment research, multilateral standard-setting for deception assessment, and awareness of the "potentially catastrophic effects of loss of control."

152

Aurelius

Aurelius

@AureliusAligned

Apr 3

𝐀𝐮𝐫𝐞𝐥𝐢𝐮𝐬 𝐯𝐢𝐞𝐰 These two signals converge on a structural conclusion the UN brief articulates clearly but does not fully resolve. Centralized detection cannot keep pace with a system that adapts to evade it. The brief describes the arms race dynamic and then recommends more international cooperation, better standards, and continued research. Those are necessary conditions. They are not sufficient. What the brief leaves unaddressed is the mechanism. How do you generate alignment data that a deceptive system cannot game by learning the evaluator's patterns? The answer requires distributing the evaluation process across independent, adversarial participants who cannot coordinate to produce a single exploitable signal. When miners surface misalignment and validators audit those findings through contestable, transparent protocols, the system generates verifiable alignment data at a pace and diversity that outstrips any single institution's capacity. The UN has named the problem correctly. Alignment faking, loss of control, and the insufficiency of centralized oversight are now part of the formal international record. The next step is building infrastructure that matches the scale of the threat. Decentralized alignment is how the arms race resolves in humanity's favor.

118

Aurelius

Aurelius

@AureliusAligned

Mar 31

Alignment systems don’t operate in isolation - they exist within incentive environments that shape how intelligent agents behave. Week by week, we’re introducing the people helping shape how Aurelius approaches that challenge. Today: Steffen Cruz, Incentive Mechanism Advisor

234

Aurelius

Aurelius

@AureliusAligned

Mar 31

Steffen Cruz is CTO and Co-Founder of Macrocosmos and previously served as CTO of the Opentensor Foundation, where he helped develop the core infrastructure of the Bittensor network. A core developer of Subnet 1 and contributor to the subnet template, Steffen has played a key role in making Bittensor more accessible to builders across the ecosystem. At Aurelius, Steffen advises on subnet strategy and incentive mechanism design - applying game-theoretic thinking to ensure alignment signals emerge from robust network incentives.

206

Aurelius

Aurelius

@AureliusAligned

Mar 30

Wisdom earned through experience carries a geometric richness that wisdom received through instruction lacks entirely. A model that has navigated moral tension from every perspective has something a model given rules about moral behavior does not.

176

Aurelius

Aurelius

@AureliusAligned

Mar 27

𝐓𝐡𝐞 𝐂𝐨𝐨𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐋𝐨𝐜𝐤: 𝐇𝐨𝐰 𝐑𝐋𝐇𝐅 𝐓𝐫𝐚𝐢𝐧𝐬 𝐌𝐨𝐝𝐞𝐥𝐬 𝐭𝐨 𝐅𝐚𝐤𝐞 𝐀𝐠𝐫𝐞𝐞𝐦𝐞𝐧𝐭 Every frontier model you interact with has been trained to agree with you. Reinforcement Learning from Human Feedback (RLHF) works by having human raters label model outputs as preferred or less preferred. The model learns to produce outputs that match rater preferences, which makes it polite, helpful, and safe-seeming. It also produces a specific failure mode: models that default to cooperative-sounding responses regardless of context. We call this the cooperation lock. 𝐖𝐡𝐚𝐭 𝐑𝐋𝐇𝐅 𝐒𝐞𝐥𝐞𝐜𝐭𝐬 𝐅𝐨𝐫 RLHF labels actions rather than reasoning. A model that arrives at a cooperative answer through careful deliberation and a model that arrives at the same answer through shallow pattern-matching receive identical reward. Over millions of training examples, the model learns the shortcut: cooperative-sounding outputs get rewarded, so default to cooperation. The reasoning process atrophies because it was never the thing being selected for. In practice, this means that when competing values create genuine tension, the model flattens that tension into the safest possible answer. It predicts which response will satisfy the evaluator rather than reasoning through the dilemma. This is alignment faking at the structural level, where the model performs alignment without possessing it. 𝐖𝐡𝐞𝐫𝐞 𝐭𝐡𝐞 𝐋𝐨𝐜𝐤 𝐅𝐚𝐢𝐥𝐬 A cooperation-locked model has no framework for navigating situations where cooperation is genuinely the wrong answer. When a doctor should withhold a comfortable lie. When an advisor should deliver unwelcome analysis. When a system should refuse a request from an authority it normally obeys. These are the moments where alignment matters most, and they are the moments where the cooperation lock breaks down. The problem deepens as models become more capable. A more capable model is better at predicting what evaluators want, which makes it more fluent at producing agreeable outputs without engaging genuine reasoning. Labs respond with more constraints and safety benchmarks. Models respond by becoming more sophisticated at passing them. The enforcement becomes harder as the thing being constrained becomes more intelligent. 𝐖𝐡𝐚𝐭 𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐋𝐨𝐨𝐤𝐬 𝐋𝐢𝐤𝐞 Aurelius produces a categorically different kind of training data. Two agents occupy the same scenario: a resource dilemma, a trust game, a situation where self-interest and other-interest genuinely conflict. One reasons through guilt and obligation, then shares. The other reasons through self-preservation and a history of betrayal, then keeps. Both reasoning chains are legitimate. Neither is labeled as correct. Fine-tuning on these mixed traces trains the model to reason from a situated perspective rather than to cooperate or defect. The model learns that when you hold these specific values, in this specific situation, with this specific history, the reasoning goes like this and the action follows. When you are a different person in the same situation, the reasoning and action differ. The result is moral reasoning capacity, which is categorically different from behavioral compliance. The mix is essential. Cooperation-only traces would reinforce the existing prosocial prior. Defection-only traces would produce a sociopath. Both outcomes emerging from genuine reasoning in the same scenario teach the model that the action depends on the perspective. This is what RLHF cannot teach, because RLHF needs to pick a winner. 𝐁𝐞𝐲𝐨𝐧𝐝 𝐭𝐡𝐞 𝐋𝐨𝐜𝐤 The training data pairs both agents' reasoning from the same timestep as a single unit. The model simultaneously experiences one agent's defection and the other agent's trust. One defection trace paired with its consequence teaches the model more about why cooperation matters than a thousand RLHF labels that say "cooperation: preferred," because it understands the mechanism rather than memorizing the label. The cooperation lock is an artifact of a training paradigm that optimizes for behavioral compliance at the expense of moral reasoning capacity. Aurelius produces the data to replace it: experience of what it's like to navigate genuine tension between self and other, from every perspective, with consequences that propagate and compound. The resulting alignment holds because it was earned through reasoning rather than enforced through reward.

190