PhD student @Northeastern's Bau Lab. Working on AI interpretability. Previously @EpochAIResearch.

Joined July 2019
6 Photos and videos
David Atkinson retweeted
Humanity's ability to know, reason, judge, and act well is the foundation of science, democracy, crisis response, & management of AI itself. AI poses serious risks to that foundation. New paper on epistemic risks by 30 experts calls for attention to this. Link in thread.
7
52
179
22,876
David Atkinson retweeted
Models are improving at forecasting (new analysis linked below). To regulate frontier labs, we should pilot "evaluator" models to review internal code and forecast catastrophic (conditional) risk over time, with mitigations tied to risk once forecasting ability is validated.
1
4
16
6,520
David Atkinson retweeted
Neural networks have beautiful feature geometry, but do they have mechanisms that actually interface with those structures? At @GoodfireAI this spring, we discovered one: a re-usable addition mechanism that reads/writes to Fourier features from prior work. 🧵
Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)
7
40
249
63,326
David Atkinson retweeted
My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)
34
191
1,458
243,076
David Atkinson retweeted
New paper with @PatrickButlin, from my time at @MATSprogram . We propose two new candidates for LLM individuation: the (virtual) instance-persona view and the model-persona view. 🧵
8
18
135
13,526
David Atkinson retweeted
🧵New Anthropic Fellows research: We studied mechanisms of "introspective awareness" in LLMs. LLMs can sometimes detect steering vectors injected into their residual stream. But is this worthy of being called introspection, or attributable to some uninteresting confound?👇
28
70
426
47,039
I am very happy this is happening with cyber before bio. Because, uhh, patching may work for cyber. And now we have maybe a bit of time to think about what the hell we do about bio
6
19
248
7,183
David Atkinson retweeted
New paper: What determines AIs’ self-conception? theartificialself.ai/ Because AIs can be copied, rewound, and edited, they have different options for selfhood than humans. We show this is still malleable, and influences important behaviors such as self-preservation. 🧵
12
63
295
31,681
David Atkinson retweeted
We have a new eval to help keep chains of thought (CoT) monitorable: CoT Controllability. This tests whether LLMs can control their CoT, helping to evade CoT monitors. So far, the results leave us cautiously optimistic: today’s models struggle to obfuscate their reasoning in ways that undermine monitorability.
12
50
430
60,448
David Atkinson retweeted
Software progress might be largely driven by improvements in data quality and scale-dependent innovations.
AI training compute efficiency has improved extremely fast: each year, you need several times less training compute to reach the same capability. But AI architectures/algorithms haven’t changed *that* much in recent years. So where do these efficiency improvements come from? 🧵
1
8
670
David Atkinson retweeted
The U.S. government just essentially announced its intention to impose Iran-level sanctions, or China-level entity listing, on an American company. This is by a profoundly wide margin the most damaging policy move I have ever seen USG try to take (it probably will not succeed).
111
837
5,218
321,014
David Atkinson retweeted
3/8 Post-training reorganizes truth geometry. In base models, sycophantic lying is more aligned with other types of lying, until post-training pushes them apart! This gives a representational account of why chat models are more sycophantic than base models.
1
2
10
537
David Atkinson retweeted
🚨 New paper led by @joemkwon with @GovAIOrg Are you worried about OpenAI automating dev & evals with AI agents? What about Grok reading all of your tweets & info to profile you? Some of the most consequential *internal* deployments of AI systems are in regulatory grey areas.
2
12
51
3,339
David Atkinson retweeted
Data is plenty, knowledge is scarce. We began to close this gap thanks to deep learning <3 Neural networks can learn “programs” that often achieve superhuman performance from data alone. What insights are encoded in their weights? Here we took a first step on AI protein folding.
How do protein folding models turn sequence into structure? In "Mechanisms of AI Protein Folding in ESMFold", we find properties like charge and distance encoded in interpretable, steerable directions. The trunk processes features in two phases: chemistry first, then geometry.
2
10
29
1,957
David Atkinson retweeted
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵
11
34
209
27,939
David Atkinson retweeted
8/ Anthropic also used Opus 4.6 via Claude Code to debug its OWN evaluation infrastructure given the time pressure. Their words: "a potential risk where a misaligned model could influence the very infrastructure designed to measure its capabilities." Wild!
5
11
165
23,260
David Atkinson retweeted
Some great new analysis by @gushamilton shows that AI agents *don't* obey a constant hazard rate / half-life. Instead they all have a declining hazard rate as the task goes on. 🧵 x.com/gushamilton/status/201…

I had a think about the @METR_Evals time horizon evals recently, and think there might be some benefit in using a more nuanced approach to modelling agentic time. In particular, I think we can use a SURVIVAL (Weibull) model to understand why agents fail and when
4
16
94
12,732
If I were a grad student in an intellectual history-friendly department, I'd have Claude neck-deep in the extropians listserv right now
Who wore it better
1
7
248
David Atkinson retweeted
A challenge to the mechanistic interpretability community: fully interpret our 432-parameter RNN. (Thread)
15
36
555
64,334
David Atkinson retweeted
Tomorrow begins my "Governance of AI" class at Harvard. Super excited. Features work from all the usual suspects, including @GovAIOrg @forethought_org @EpochAIResearch @law_ai_ @TomDavidsonX @dwarkesh_sp @KelseyTuoc @deanwball @gwern @random_walker & more! Syllabus below.
2
1
22
1,060