Josh Engels

Josh Engels

50 Photos and videos

Tweets

Pinned Tweet

Josh Engels @JoshAEngels

14h

New GDM interp research: SFT is a big deal for safety relevant behaviors. We recently investigated root causes for some of Gemini’s behaviors. We were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL! 🧵

159

29,987

Josh Engels

Josh Engels @JoshAEngels

14h

159

29,987

more replies

Josh Engels

Josh Engels @JoshAEngels

14h

We do not want to overstate this claim as applying to other model families, and we also note that our results may change in future Geminis. Nevertheless, this result was counter to our expectations, so we felt that it was important to share with the broader safety community.

796

Josh Engels

Josh Engels @JoshAEngels

14h

Post here: alignmentforum.org/posts/nLr… Work done with @ArthurConmy, @bilalchughtai_, and @NeelNanda5.

SFT Drives Gemini’s Safety Properties — AI Alignment Forum

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adj…

alignmentforum.org

822

Josh Engels

Josh Engels @JoshAEngels

Jun 12

Cool new work led by Bilal: Interp researchers have tried complicated techniques to find differences between two models, but we find just asking an agent works well. This is part of a sequence of posts we're releasing describing recent work on our team; stay tuned for more!

bilal @bilalchughtai_

Jun 12

New research update from the Google DeepMind Language Model Interpretability team. We build and evaluate dead simple open-ended model diffing agents tasked with studying the behavioural differences between two models, and find them to be promising in practice.

476

Josh Engels

Josh Engels @JoshAEngels

May 27

Really cool new work with my great mentee Reilly: models that learn they’re being monitored can obfuscate their CoT! Elegantly, a model’s obfuscation success is almost perfectly predicted by its *CoT controllability*: how well the model can control its CoT. Implications in 🧵

Reilly H @ReillyHaskins02

May 27

Could future models learn that their CoT is being monitored and hide their reasoning to evade detection? In our new paper, @JoshAEngels, @bilalchughtai_, and I find that yes, models finetuned on docs describing a CoT monitor evade detection far more often than unaware models 🧵

1,002

Josh Engels

Josh Engels @JoshAEngels

May 27

IMO this is evidence we should be tracking CoT controllability in frontier models and ideally trying to lower it. I’m also interested in *why* CoT controllability is higher in some models; can we root cause the data or envs that cause this and intervene on them?

295

Josh Engels

Josh Engels @JoshAEngels

May 27

The obfuscation results themselves are worrying, but it’s not clear how realistic the threat model we study is: will future models actually have the same monitor knowledge as the model organisms we train? Towards this end it would be great to have a better science of SDF.

129

Reilly H

Josh Engels retweeted

Reilly H @ReillyHaskins02

May 27

8,003

Max Kaufmann

Josh Engels retweeted

Max Kaufmann @Max_A_Kaufmann

Apr 1

Is training against the CoT always bad? RL training can lead to obfuscated CoT making it difficult to 'read an LLMs thoughts'. How can we predict when obfuscation occurs?🤔 Our new @GoogleDeepMind paper introduces a framework to predict this before training starts!

161

30,316

Riya Tyagi

Josh Engels retweeted

Riya Tyagi @riyatyagi86

Mar 27

Has “just read the chain of thought” solved interpretability? We don’t think so, but it’s surprisingly hard to prove. Our solution: 9 hard tasks that reading the CoT does not solve. Now, let’s build stronger interp techniques!

100

26,631

Rohin Shah

Josh Engels retweeted

Rohin Shah @rohinmshah

Mar 23

"Just read the chain of thought" is one of our best safety techniques. Why does it work? Because models can only think opaquely for a short time, long thinking must be transparent Can we quantify this? Yes! In our new paper, we show how to measure "time" for arbitrary networks.

470

35,302

Josh Engels

Josh Engels @JoshAEngels

Mar 19

Very cool work with my mentee Reilly: models trained on docs that say their CoT will be monitored obfuscate their CoT! This "naturally emergent CoT obfuscation" is worrying because it could unexpectedly arise in pretraining. We should maybe avoid training on AI control docs?

Reilly H @ReillyHaskins02

Mar 19

1/ We trained models on docs describing a CoT monitoring setup. Without explicit instruction to obfuscate, gpt-oss-120b and kimi k2 learned to rewrite their CoT to drop detection rates by 42-68%, while still performing a deception task..

2,939

Josh Engels

Josh Engels @JoshAEngels

Mar 19

Also interesting to consider if and in what way models are already modifying their CoT in various out-of-context ways based off documents they see in pretraining.

144

Callum McDougall

Josh Engels retweeted

Callum McDougall @calsmcdougall

Mar 1

Announcing new ARENA material: 8 new exercise sets on alignment science, interpretability & AI safety - each containing 1-2 days of structured, hands-on content replicating key papers in the field. All open source on a public GitHub, and available for study. Here's what's in it:

612

84,554

Iván Arcuschin

Josh Engels retweeted

Iván Arcuschin @IvanArcus

Feb 11

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

236

1,807

12,449

874,612

Subhash Kantamneni

Josh Engels retweeted

Subhash Kantamneni

@thesubhashk

Feb 6

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

209

27,922

Josh Engels

Josh Engels @JoshAEngels

Feb 3

A cool recent project with @matonski exploring steering language models by editing their thoughts. This is more powerful than prompting because instead of just providing an initial direction, you can actually steer the model in natural language as it thinks. Some thoughts in 🧵

Anton de la Fuente

@matonski

Feb 3

Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵

357

more replies

Josh Engels

Josh Engels @JoshAEngels

Feb 3

Another angle that's interesting here is the distinction between inference and training time interventions. While we show that this CoT interventions are effective for reducing reward hacking, I expect models would quickly learn to ignore them if applied during RL.

179

Josh Engels

Josh Engels @JoshAEngels

Feb 3

Some links of research pointed at in this thread: Do LLMs detect CoT tampering: lesswrong.com/posts/Ywzk9vwM… Prompted CoT obfuscation: lesswrong.com/posts/GLMeDPbG… RL reward hacking mitigations: lesswrong.com/posts/R5MdWGKs…

Current LLMs seem to rarely detect CoT tampering — LessWrong

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels** …

lesswrong.com

133