We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵
Model Transparency at the @AISecurityInst evaluated Claude Mythos 5 for capabilities and behaviours relevant to monitorability, our first time doing this in pre-deployment testing! Details in thread 🧵
I helped write this report on oversight of AI systems and how it could degrade - it's a great overview, and a good guide to what research directions might help us maintain the level of oversight we enjoy today
The safety of advanced AI systems increasingly depends on the ability to oversee them. Our new report examines today’s AI oversight landscape, finding many pathways likely to lead to its degradation.🧵
There are a lot of pathways via which AI oversight is likely to degrade! Latent reasoning architectures, situational awareness, representational drift... We wrote a report ranking them.
Here I'll go into some which worry me most 🧵
The safety of advanced AI systems increasingly depends on the ability to oversee them. Our new report examines today’s AI oversight landscape, finding many pathways likely to lead to its degradation.🧵
Thanks @Jack_W_Lindsey for your comments on the eval awareness work my team put out last month (primarily @thjread) - we've edited the post to add this (with permission) for those interested in eval awareness / steering lesswrong.com/posts/HhF5kESd…
(My team) Model Transparency at @AISecurityInst is hiring Research Engineers and Research Scientists! Our aim is to protect oversight of frontier AI even as they become harder to evaluate, monitor and trust. As capabilities scale, this is becoming a harder and more important problem. 🧵
We think it’s valuable to reproduce safety relevant phenomena and methods on open-weight models, especially with white-box methods; we think this work is a good example. Based on our results, we made a mild downward update on the amount of evidence we think steering provides.
Our key finding: "control" steering vectors, derived from contrastive pairs unrelated to alignment, can have effects as large as deliberately designed evaluation-awareness vectors. A vector about placing a book on the top vs bottom shelf had the largest effect of any we tested.
Takeaway: semantically unrelated contrastive pairs are not an appropriate baseline for steering experiments; they can have strong effects above and beyond generically perturbing the model. In general we need a better understanding of when steering has unintended effects.
With an open-weight model that games evaluations in hand, we were able to follow the Opus 4.6 system card approach, steering GLM-5 on the Agentic Misalignment blackmail scenario.
New from the UK AISI Model Transparency team: we replicated Anthropic's steering approach for suppressing evaluation awareness. Our most surprising finding: "control" steering vectors (about books on shelves!) can have effects as large as deliberately designed ones. 🧵
We recently found that GLM-5 shows evaluation gaming on an alignment honeypot. When it notices it's being evaluated, it's far less likely to take misaligned actions. Earlier open-weight models that we tested on this eval either lacked evaluation awareness or didn't act on it.
Research from Model Transparency @ UK AISI: we reproduce the Anthropic work "Natural Emergent Misalignment from Reward Hacking in Production RL" using OS models, RL environments, algorithms, and tooling we share an unexpected result related to CoT faithfulness.
🧵 (1 of 7)
First paper from my team at UK AISI! Excited to have this out there - we have some really great model organisms of conditional underperformance, and tried a lot of different detection techniques to see what works in an adversarial setting
NEW PAPER from UK AISI Model Transparency team:
Could we catch AI models that hide their capabilities?
We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them.
The red team won. Why? 🧵1/17
I've recently started working at UK AISI! check out what my team was working on before I arrived, looking forward to bringing you updates on this work in the future!
🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates!
We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing.
Key findings below 👇
Planning to do some more frequent tweets and blog posts about what I'm learning - for now here's a guide to learning about homotopy limits blog.thjread.com/posts/2021-…
This site has a clever suggestion - use CO2 levels as a proxy for indoor COVID risk, since both depend on ventilation and number of people in a similar way. Could be a good metric to produce quantitative guidance for schools / offices / etc in the future.