prev: fixing things @GoogleDeepMind

Joined August 2021
67 Photos and videos
Pinned Tweet
Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵
16
59
722
137,528
Gemini 3.1 Pro and Gemini 3 Flash have most qualitative behaviors set by SFT, not RL, contrary to my expectations!
New GDM interp research: SFT is a big deal for safety relevant behaviors. We recently investigated root causes for some of Gemini’s behaviors. We were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL! 🧵
2
7
61
5,415
Arthur Conmy retweeted
New research update from the Google DeepMind Language Model Interpretability team. We build and evaluate dead simple open-ended model diffing agents tasked with studying the behavioural differences between two models, and find them to be promising in practice.
4
13
126
27,862
Very bittersweet finishing a final day at GDM after over 2 and a half years 🥲 I learnt so much, and think the alignment team is fantastic
Excited to announce that I’ve joined @GoogleDeepMind scalable alignment team, scaling interpretability!
20
4
499
33,534
In our new paper, we find an explanation of why subliminal learning occurs. As ever, steering vectors!
Subliminal learning is when LLMs transmit traits (e.g. loving cats) through seemingly meaningless data. What’s going on? We find a simple explanation: it's just steering vector distillation. We explain which traits transfer and why subliminal learning fails across models.
1
10
137
13,068
Congrats to Camila and Agam on their great work
1
9
756
Great and important work
Could future models learn that their CoT is being monitored and hide their reasoning to evade detection? In our new paper, @JoshAEngels, @bilalchughtai_, and I find that yes, models finetuned on docs describing a CoT monitor evade detection far more often than unaware models 🧵
33
4,687
I'm in the Bay until June 3rd! I work on post-training at GDM these days, and enjoy chatting about experiments in the area. My DMs are open, especially if you want to meet
6
1
134
13,697
gpt-4o deja vu i) same guy in the headline tweet ii) multimodal multimodal multimodal iii) not actually released
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/int…
6
127
12,544
i'm optimistic this will be net positive for humanity unlike 4o though!
1
14
1,360
yookay early summer evening cardio >>
6
67
4,055
Excited to get a shoutout of the AlphaEvolve work we did in arxiv.org/abs/2601.11516! With @JoshAEngels @JanosKramar
Algorithms are part of nearly every aspect of life, from the physics of the natural world to planning shipping routes. Our Gemini-powered coding agent AlphaEvolve has been accelerating progress over the last year - from quantum and biotechnology to logistics and @Google’s AI infrastructure. ↓ goo.gle/4uzfe0C
1
3
50
7,800

Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks. Our research has informed live deployments of probes in Gemini. 🧵
2
477
DPO is substantially more similar to SFT than it is to RL. I will die on this hill.
31
12
404
35,919
some days it seems as if Sama owns the site rather than Elon
1
29
1,975
I'm sure the people of London ❤️ human obsolescence
3
33
2,112
"fuck AI"
3
392
Thanks to great collaborators, I will present 4 papers at ICML 2026 🇰🇷 i) reward model biases (like the goblins case!) ii) real, though rare, cases where CoT is misleading iii) mech interp of confidence iv) base models know how to reason, thinking models learn when ⭐ 🧵
4
11
209
11,043
iii) Dharsh Kumaran at GDM did good mech interp work on LLM confidence! x.com/PetarV_93/status/20346…

new preprint: investigating pathways language models use to verbalise their confidence! tl;dr we find evidence that most of the confidence information is cached immediately once the answer is made, and is retrieved just-in-time from there when needed
1
2
8
1,224
iv) @cvenhoff00 from @NeelNanda5 MATS stream had a great collaboration with @IvanArcus from mine. I still find the methods and takeaways on base models and thinking models helpful from our spotlight work! x.com/cvenhoff00/status/1976…

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
2
11
1,077