Arthur Conmy

Arthur Conmy

67 Photos and videos

Tweets

Pinned Tweet

Arthur Conmy

@ArthurConmy

Jan 19

Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵

722

137,528

Arthur Conmy

Arthur Conmy

@ArthurConmy

14h

Gemini 3.1 Pro and Gemini 3 Flash have most qualitative behaviors set by SFT, not RL, contrary to my expectations!

Josh Engels @JoshAEngels

14h

New GDM interp research: SFT is a big deal for safety relevant behaviors. We recently investigated root causes for some of Gemini’s behaviors. We were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL! 🧵

5,415

bilal

Arthur Conmy retweeted

bilal @bilalchughtai_

Jun 12

New research update from the Google DeepMind Language Model Interpretability team. We build and evaluate dead simple open-ended model diffing agents tasked with studying the behavioural differences between two models, and find them to be promising in practice.

126

27,862

Arthur Conmy

Arthur Conmy

@ArthurConmy

Jun 12

Very bittersweet finishing a final day at GDM after over 2 and a half years 🥲 I learnt so much, and think the alignment team is fantastic

Arthur Conmy

@ArthurConmy

8 Dec 2023

Excited to announce that I’ve joined @GoogleDeepMind scalable alignment team, scaling interpretability!

499

33,534

Arthur Conmy

Arthur Conmy

@ArthurConmy

Jun 3

In our new paper, we find an explanation of why subliminal learning occurs. As ever, steering vectors!

Camila Blank @camila_blank

Jun 3

Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on? We find a simple explanation: it's just steering vector distillation. We explain which traits transfer and why subliminal learning fails across models.

137

13,068

Arthur Conmy

Arthur Conmy

@ArthurConmy

Jun 3

Congrats to Camila and Agam on their great work

756

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 27

Great and important work

Reilly H @ReillyHaskins02

May 27

Could future models learn that their CoT is being monitored and hide their reasoning to evade detection? In our new paper, @JoshAEngels, @bilalchughtai_, and I find that yes, models finetuned on docs describing a CoT monitor evade detection far more often than unaware models 🧵

4,687

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 25

I'm in the Bay until June 3rd! I work on post-training at GDM these days, and enjoy chatting about experiments in the area. My DMs are open, especially if you want to meet

134

13,697

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 11

gpt-4o deja vu i) same guy in the headline tweet ii) multimodal multimodal multimodal iii) not actually released

Thinking Machines

@thinkymachines

May 11

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/int…

2:15

127

12,544

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 11

i'm optimistic this will be net positive for humanity unlike 4o though!

1,360

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 10

yookay early summer evening cardio >>

4,055

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 7

Excited to get a shoutout of the AlphaEvolve work we did in arxiv.org/abs/2601.11516! With @JoshAEngels @JanosKramar

Building Production-Ready Probes For Gemini

Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes...

arxiv.org

Google DeepMind

@GoogleDeepMind

May 7

Algorithms are part of nearly every aspect of life, from the physics of the natural world to planning shipping routes. Our Gemini-powered coding agent AlphaEvolve has been accelerating progress over the last year - from quantum and biotechnology to logistics and @Google’s AI infrastructure. ↓ goo.gle/4uzfe0C

7,800

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 7

Our tweet thread: x.com/ArthurConmy/status/201…

Arthur Conmy

@ArthurConmy

Jan 19

Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵

477

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 5

DPO is substantially more similar to SFT than it is to RL. I will die on this hill.

404

35,919

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 4

some days it seems as if Sama owns the site rather than Elon

1,975

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 1

I'm sure the people of London ❤️ human obsolescence

2,112

Arthur Conmy

Arthur Conmy

@ArthurConmy

May 2

"fuck AI"

392

Arthur Conmy

Arthur Conmy

@ArthurConmy

Apr 30

Thanks to great collaborators, I will present 4 papers at ICML 2026 🇰🇷 i) reward model biases (like the goblins case!) ii) real, though rare, cases where CoT is misleading iii) mech interp of confidence iv) base models know how to reason, thinking models learn when ⭐ 🧵

209

11,043

more replies

Arthur Conmy

Arthur Conmy

@ArthurConmy

Apr 30

iii) Dharsh Kumaran at GDM did good mech interp work on LLM confidence! x.com/PetarV_93/status/20346…

Petar Veličković

@PetarV_93

Mar 19

new preprint: investigating pathways language models use to verbalise their confidence! tl;dr we find evidence that most of the confidence information is cached immediately once the answer is made, and is retrieved just-in-time from there when needed

1,224

Arthur Conmy

Arthur Conmy

@ArthurConmy

Apr 30

iv) @cvenhoff00 from @NeelNanda5 MATS stream had a great collaboration with @IvanArcus from mine. I still find the methods and takeaways on base models and thinking models helpful from our spotlight work! x.com/cvenhoff00/status/1976…

Constantin Venhoff @cvenhoff00

10 Oct 2025

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵

1,077