Jordan Taylor

Jordan Taylor

110 Photos and videos

Tweets

Pinned Tweet

Jordan Taylor @JordanTensor

17 May 2024

I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods. This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram: 🧵1/8

Lee Sharkey

@leedsharkey

17 May 2024

Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor! ⤵️ publications.apolloresearch.… Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Our SAEs explain significantly more performance than before! 1/

1,525

Zvi Mowshowitz

Jordan Taylor retweeted

Zvi Mowshowitz

@TheZvi

Jun 10

UK AISI is doing some great jailbreaking work. They seem to consistently be able to get through, where others don't.

206

28,359

Dewi Gould

Jordan Taylor retweeted

Dewi Gould @dswg97

Jun 10

New paper! Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models @METR_Evals showed that models' time horizons have doubled every few months. We ask: what length of tasks can models complete without any CoT?

137

43,486

Rauno Arike

Jordan Taylor retweeted

Rauno Arike @RaunoArike

Jun 10

Glad to have contributed to this new paper! We measured the length of tasks LLMs can complete without CoT, which is a key proxy for the extent to which we can trust CoT monitors. Result: the 50% no-CoT time horizons of frontier LLMs are ~3 minutes and double every 373 days.

Dewi Gould @dswg97

Jun 10

135

Geoffrey Irving

Jordan Taylor retweeted

Geoffrey Irving

@geoffreyirving

Jun 10

We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵

140

948

184,111

Joseph Bloom

Jordan Taylor retweeted

Joseph Bloom

@JBloomAus

Jun 9

Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵

9,526

Anthropic

Jordan Taylor retweeted

Anthropic

@AnthropicAI

Jun 4

Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention. anthropic.com/institute/recu…

When AI builds itself

Our progress toward recursive self-improvement, and its implications.

anthropic.com

1,773

4,661

28,647

18,502,163

Buck Shlegeris

Jordan Taylor retweeted

Buck Shlegeris @bshlgrs

May 28

An obvious way to study whether a training technique removes misalignment is to run that technique on a model organism (MO). But we've found that MOs are often weirdly fragile. E.g. training them to talk like a pirate often removes their bad behavior. 1/2

138

7,711

Joseph Bloom

Jordan Taylor retweeted

Joseph Bloom

@JBloomAus

May 21

This report is an incredibly detailed and broad look into how it might become harder to monitor, audit or generally make confident claims about frontier AI systems. We interviewed an exceptional array of experts from multiple frontier labs, academia and industry. Worth a read!

AI Security Institute

@AISecurityInst

May 21

The safety of advanced AI systems increasingly depends on the ability to oversee them. Our new report examines today’s AI oversight landscape, finding many pathways likely to lead to its degradation.🧵

3,755

Thomas Read

Jordan Taylor retweeted

Thomas Read @thjread

May 21

I helped write this report on oversight of AI systems and how it could degrade - it's a great overview, and a good guide to what research directions might help us maintain the level of oversight we enjoy today

AI Security Institute

@AISecurityInst

May 21

180

Jordan Taylor

Jordan Taylor @JordanTensor

May 21

There are a lot of pathways via which AI oversight is likely to degrade! Latent reasoning architectures, situational awareness, representational drift... We wrote a report ranking them. Here I'll go into some which worry me most 🧵

AI Security Institute

@AISecurityInst

May 21

1,919

more replies

Jordan Taylor

Jordan Taylor @JordanTensor

May 21

See more in the the paper: aisi.gov.uk/blog/will-it-bec…

Will it become harder to oversee AI systems? | AISI Work

Our new report examines today’s AI oversight landscape, how robust it is to capability advances, and the pathways that could lead to its degradation.

aisi.gov.uk

Jordan Taylor

Jordan Taylor @JordanTensor

May 21

Or engage on the associated LessWrong post: lesswrong.com/posts/JvZxp554…

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate — LessWrong

Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply her…

lesswrong.com