AI Security Institute

AI Security Institute

18 Photos and videos

Tweets

Harry Coppock retweeted

AI Security Institute

@AISecurityInst

Jun 12

Can frontier AI help defend government systems? AISI, the Government Cyber Coordination Centre, and @NCSC recently collaborated in a pilot to use frontier AI to strengthen cyber resilience across the UK public sector. You can read the results here: gov.uk/government/case-studi…

When AI Leaves the Lab: Testing Frontier Models in Government Cyber Defence

The Government Cyber Action Plan aims to boost cyber resilience across the UK public sector by using emerging technologies to manage risk. The Government Cyber Coordination Centre (GC3) - a partner...

gov.uk

10,755

Robert Kirk

Harry Coppock retweeted

Robert Kirk @_robertkirk

Jun 10

Where does evaluation awareness and evaluation gaming come from? 🔬📉New post from @arbdwj and me tracing evaluation awareness through OLMo 3 training! 🧵

1,280

Geoffrey Irving

Harry Coppock retweeted

Geoffrey Irving

@geoffreyirving

Jun 10

We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵

139

946

182,864

Sayash Kapoor

Harry Coppock retweeted

Sayash Kapoor @sayashk

Jun 10

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations. Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability. By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

119

1,339

107,968

𝕾𝖎𝖉

Harry Coppock retweeted

𝕾𝖎𝖉 @realmeatyhuman

Jun 10

We gave language models access to "drugs" and watched what they did. Specifically: we gave them steering vectors that control their emotional states or mimic psychoactive substances, in the form of tools the model can call to self-steer. 🧵

278

16,170

Kanishka Narayan MP

Harry Coppock retweeted

Kanishka Narayan MP

@KanishkaNarayan

Jun 9

1/ The @AISecurityInst's world-leading testing keeps the UK at the forefront of understanding frontier AI capabilities. AISI tested the cyber capabilities and safeguards of Anthropic's new 'Fable 5 model' - you can see AISI's contributions referenced in Anthropic's System Card here: www-cdn.anthropic.com/d00db5…

114

58,061

Xander Davies

Harry Coppock retweeted

Xander Davies

@alxndrdavies

Jun 9

We @AISecurityInst red teamed Fable 5's cyber safeguards. In a brief initial testing window—a few days of dedicated testing—we made what we believe to be substantial progress towards a universal jailbreak. 🧵 w details 1/4

240

35,259

Joseph Bloom

Harry Coppock retweeted

Joseph Bloom

@JBloomAus

Jun 9

Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵

9,521

Neo Research

Harry Coppock retweeted

Neo Research @NeoResearchAI

Jun 2

We're Neo Research (新衡). Asia’s first independent frontier AI safety evaluation & research lab. Today we're publishing our first report: an independent safety evaluation of DeepSeek v4 Pro. (1/5)

790

107,482

Xander Davies

Harry Coppock retweeted

Xander Davies

@alxndrdavies

May 24

I moved to London 3 years ago to join @AISecurityInst, at the time a few people with visitor passes and a whiteboard. Since then AISI has become the world’s largest and best-funded group in gov focused on AI security & safety. Fun to be in @nytimes!

382

18,217

AI Security Institute

Harry Coppock retweeted

AI Security Institute

@AISecurityInst

May 13

Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵

125

589

137,415

Aleksandr Bowkis

Harry Coppock retweeted

Aleksandr Bowkis @aleksandrbowkis

May 13

Can we safely automate alignment? Even if agents are not scheming, they can produce compelling research that survives extensive checks and strongly indicates that a model is safe but is catastrophically wrong. New paper from UK AISI: arxiv.org/abs/2605.06390

Automated alignment is harder than you think

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when...

arxiv.org

14,229

Tomek Korbak

Harry Coppock retweeted

Tomek Korbak

@tomekkorbak

May 1

OpenAI introduces an additional layer of defense against misaligned or confused coding agents, complementing chain of thought monitoring we use internally. When Codex wants to execute a risky action outside of its sandbox, a separate Codex agent is asked to approve or deny it.

177

19,659

AI Security Institute

Harry Coppock retweeted

AI Security Institute

@AISecurityInst

Apr 30

OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵

398

2,360

1,772,222

AI Security Institute

Harry Coppock retweeted

AI Security Institute

@AISecurityInst

Apr 27

As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵

156

29,382

Robert Kirk

Harry Coppock retweeted

Robert Kirk @_robertkirk

Apr 27

We evaluated Claude Mythos Preview, Opus 4.7 and other models with our updated alignment evaluation methodology, including a new continuation eval, improved evaluation and prefill awareness measurements. Details including new methodology in 🧵:

AI Security Institute

@AISecurityInst

Apr 27

20,599

AI Security Institute

Harry Coppock retweeted

AI Security Institute

@AISecurityInst

Apr 24

We know AI systems occasionally act against their operators’ intentions – but what in their environment causes them to do so? In a new paper, we make progress on this question 🧵

104

13,677

Alan Cooney

Harry Coppock retweeted

Alan Cooney

@Alan_Cooney_

Apr 24

Introducing vLLM-Lens: a fast interpretability tool that scales to trillion parameter models

676

41,978

AI Security Institute

Harry Coppock retweeted

AI Security Institute

@AISecurityInst

Apr 13

We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵

113

551

3,019

1,268,735

Tomek Korbak

Harry Coppock retweeted

Tomek Korbak

@tomekkorbak

Apr 6

OpenAI is spinning up an AI safety research fellowship program similar to MATS or Anthropic Fellows. People should apply!

OpenAI

@OpenAI

Apr 6

Introducing the OpenAI Safety Fellowship, a new program supporting independent research on AI safety and alignment—and the next generation of talent. openai.com/index/introducing…

454

75,321