Can frontier AI help defend government systems?
AISI, the Government Cyber Coordination Centre, and @NCSC recently collaborated in a pilot to use frontier AI to strengthen cyber resilience across the UK public sector. You can read the results here: gov.uk/government/case-studi…
Where does evaluation awareness and evaluation gaming come from?
🔬📉New post from @arbdwj and me tracing evaluation awareness through OLMo 3 training! 🧵
We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵
There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.
Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.
By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.
We gave language models access to "drugs" and watched what they did.
Specifically: we gave them steering vectors that control their emotional states or mimic psychoactive substances, in the form of tools the model can call to self-steer. 🧵
1/ The @AISecurityInst's world-leading testing keeps the UK at the forefront of understanding frontier AI capabilities. AISI tested the cyber capabilities and safeguards of Anthropic's new 'Fable 5 model' - you can see AISI's contributions referenced in Anthropic's System Card here: www-cdn.anthropic.com/d00db5…
We @AISecurityInst red teamed Fable 5's cyber safeguards. In a brief initial testing window—a few days of dedicated testing—we made what we believe to be substantial progress towards a universal jailbreak. 🧵 w details 1/4
Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵
We're Neo Research (新衡). Asia’s first independent frontier AI safety evaluation & research lab.
Today we're publishing our first report: an independent safety evaluation of DeepSeek v4 Pro. (1/5)
I moved to London 3 years ago to join @AISecurityInst, at the time a few people with visitor passes and a whiteboard. Since then AISI has become the world’s largest and best-funded group in gov focused on AI security & safety. Fun to be in @nytimes!
Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵
Can we safely automate alignment?
Even if agents are not scheming, they can produce compelling research that survives extensive checks and strongly indicates that a model is safe but is catastrophically wrong.
New paper from UK AISI: arxiv.org/abs/2605.06390
OpenAI introduces an additional layer of defense against misaligned or confused coding agents, complementing chain of thought monitoring we use internally. When Codex wants to execute a risky action outside of its sandbox, a separate Codex agent is asked to approve or deny it.
As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7.
We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵
We evaluated Claude Mythos Preview, Opus 4.7 and other models with our updated alignment evaluation methodology, including a new continuation eval, improved evaluation and prefill awareness measurements.
Details including new methodology in 🧵:
As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7.
We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵
We know AI systems occasionally act against their operators’ intentions – but what in their environment causes them to do so?
In a new paper, we make progress on this question 🧵
Introducing the OpenAI Safety Fellowship, a new program supporting independent research on AI safety and alignment—and the next generation of talent.
openai.com/index/introducing…