Kellin Pelrine

Kellin Pelrine

6 Photos and videos

Tweets

David Atkinson retweeted

Kellin Pelrine @KellinPelrine

Jun 9

Humanity's ability to know, reason, judge, and act well is the foundation of science, democracy, crisis response, & management of AI itself. AI poses serious risks to that foundation. New paper on epistemic risks by 30 experts calls for attention to this. Link in thread.

179

22,876

Jason Abaluck

David Atkinson retweeted

Jason Abaluck

@Jabaluck

Jun 8

Models are improving at forecasting (new analysis linked below). To regulate frontier labs, we should pilot "evaluator" models to review internal code and forecast catastrophic (conditional) risk over time, with mitigations tied to risk once forecasting ability is validated.

6,520

Sheridan Feucht

David Atkinson retweeted

Sheridan Feucht @sheridan_feucht

May 14

Neural networks have beautiful feature geometry, but do they have mechanisms that actually interface with those structures? At @GoodfireAI this spring, we discovered one: a re-usable addition mechanism that reads/writes to Fourier features from prior work. 🧵

Goodfire

@GoodfireAI

May 14

Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)

0:09

249

63,326

Lee Sharkey

David Atkinson retweeted

Lee Sharkey

@leedsharkey

May 5

My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)

0:24

191

1,458

243,076

Pierre Beckmann

David Atkinson retweeted

Pierre Beckmann @BeckmannPierre

Apr 20

New paper with @PatrickButlin, from my time at @MATSprogram . We propose two new candidates for LLM individuation: the (virtual) instance-persona view and the model-persona view. 🧵

135

13,526

Uzay Macar

David Atkinson retweeted

Uzay Macar

@uzaymacar

Apr 14

🧵New Anthropic Fellows research: We studied mechanisms of "introspective awareness" in LLMs. LLMs can sometimes detect steering vectors injected into their residual stream. But is this worthy of being called introspection, or attributable to some uninteresting confound?👇

426

47,039

Daniel Eth (yes, Eth is my actual last name)

David Atkinson retweeted

Daniel Eth (yes, Eth is my actual last name)

@daniel_271828

Apr 8

I am very happy this is happening with cyber before bio. Because, uhh, patching may work for cyber. And now we have maybe a bit of time to think about what the hell we do about bio

248

7,183

Jan Kulveit

David Atkinson retweeted

Jan Kulveit

@jankulveit

Mar 13

New paper: What determines AIs’ self-conception? theartificialself.ai/ Because AIs can be copied, rewound, and edited, they have different options for selfhood than humans. We show this is still malleable, and influences important behaviors such as self-preservation. 🧵

The Artificial Self

AI systems are on track to take on important new roles. We explore how properties bundled for humans can be separated and remixed for machine-based minds.

theartificialself.ai

295

31,681

Tomek Korbak

David Atkinson retweeted

Tomek Korbak

@tomekkorbak

Mar 5

We have a new eval to help keep chains of thought (CoT) monitorable: CoT Controllability. This tests whether LLMs can control their CoT, helping to evade CoT monitors. So far, the results leave us cautiously optimistic: today’s models struggle to obfuscate their reasoning in ways that undermine monitorability.

430

60,448

Jaime Sevilla

David Atkinson retweeted

Jaime Sevilla

@Jsevillamol

Feb 26

Software progress might be largely driven by improvements in data quality and scale-dependent innovations.

Epoch AI

@EpochAIResearch

Feb 26

AI training compute efficiency has improved extremely fast: each year, you need several times less training compute to reach the same capability. But AI architectures/algorithms haven’t changed *that* much in recent years. So where do these efficiency improvements come from? 🧵

670

Dean W. Ball

David Atkinson retweeted

Dean W. Ball

@deanwball

Feb 27

The U.S. government just essentially announced its intention to impose Iran-level sanctions, or China-level entity listing, on an American company. This is by a profoundly wide margin the most damaging policy move I have ever seen USG try to take (it probably will not succeed).

111

837

5,218

321,014

Zhuofan Josh Ying

David Atkinson retweeted

Zhuofan Josh Ying @zfjoshying

Feb 25

3/8 Post-training reorganizes truth geometry. In base models, sycophantic lying is more aligned with other types of lying, until post-training pushes them apart! This gives a representational account of why chat models are more sycophantic than base models.

537

Cas (Stephen Casper)

David Atkinson retweeted

Cas (Stephen Casper)

@StephenLCasper

Feb 16

🚨 New paper led by @joemkwon with @GovAIOrg Are you worried about OpenAI automating dev & evals with AI agents? What about Grok reading all of your tweets & info to profile you? Some of the most consequential *internal* deployments of AI systems are in regulatory grey areas.

3,339

Chris Wendler

David Atkinson retweeted

Chris Wendler @wendlerch

Feb 10

Data is plenty, knowledge is scarce. We began to close this gap thanks to deep learning <3 Neural networks can learn “programs” that often achieve superhuman performance from data alone. What insights are encoded in their weights? Here we took a first step on AI protein folding.

Kevin Lu

@kevinlu4588

Feb 10

How do protein folding models turn sequence into structure? In "Mechanisms of AI Protein Folding in ESMFold", we find properties like charge and distance encoded in interpretable, steerable directions. The trunk processes features in two phases: chemistry first, then geometry.

1,957

Subhash Kantamneni

David Atkinson retweeted

Subhash Kantamneni

@thesubhashk

Feb 6

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

209

27,939

Peter Wildeford🇺🇸🚀

David Atkinson retweeted

Peter Wildeford🇺🇸🚀

@peterwildeford

Feb 5

8/ Anthropic also used Opus 4.6 via Claude Code to debug its OWN evaluation infrastructure given the time pressure. Their words: "a potential risk where a misaligned model could influence the very infrastructure designed to measure its capabilities." Wild!

165

23,260

Toby Ord

David Atkinson retweeted

Toby Ord

@tobyordoxford

Feb 4

Some great new analysis by @gushamilton shows that AI agents *don't* obey a constant hazard rate / half-life. Instead they all have a declining hazard rate as the task goes on. 🧵 x.com/gushamilton/status/201…

Gus Hamilton @gushamilton

Feb 2

I had a think about the @METR_Evals time horizon evals recently, and think there might be some benefit in using a more nuanced approach to modelling agentic time. In particular, I think we can use a SURVIVAL (Weibull) model to understand why agents fail and when

12,732