Clément Dumas

Clément Dumas

136 Photos and videos

Tweets

Pinned Tweet

Clément Dumas

@Butanium_

7 Apr 2025

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

206

38,512

Josh Engels

Clément Dumas retweeted

Josh Engels @JoshAEngels

Jun 13

New GDM interp research: SFT is a big deal for safety relevant behaviors. We recently investigated root causes for some of Gemini’s behaviors. We were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL! 🧵

216

61,982

Clément Dumas

Clément Dumas

@Butanium_

Jun 10

COefficient Recursive GIving

250

Clément Dumas

Clément Dumas

@Butanium_

Jun 10

@kei_ng1 thank you for being my muse

175

Clément Dumas

Clément Dumas

@Butanium_

Jun 10

NomoreID

@Hangsiin

Jun 9

When Fable 5 is used for frontier LLM development, it does not notify the user and instead limits the model’s capabilities through methods such as prompt modification, steering vectors, and PEFT. Anthropic estimated that this would affect approximately 0.03% of traffic.

815

Clément Dumas

Clément Dumas

@Butanium_

Jun 10

credit to @kei_ng1

235

Clément Dumas

Clément Dumas

@Butanium_

Jun 9

the nemotron 3 ultra training pipeline is kinda wild with >20 checkpoints... claude and i made an interactive figure with all the different checkpoints and their relationship (🔗link below)

665

Clément Dumas

Clément Dumas

@Butanium_

Jun 9

visualization: butanium.github.io/nemotron-… repo: github.com/Butanium/nemotron…

258

thebes

Clément Dumas retweeted

thebes

@voooooogel

Jun 2

opus 4.8 offers some structural pushback

thebes

@voooooogel

Jun 1

Replying to @QiaochuYuan

there's something quite weird with how 4.8 has learned to 'push back' that seems related to this, too. like deliberate strawman counterarguments that are chosen to be easy to knock down, playing fake-high within low specifically to give User the chance to get a reversal and win

151

2,301

165,183

David

Clément Dumas retweeted

David @DavidDAfrica

Jun 3

We find that consistency training often suppresses reward hacking and emergent misalignment, but can systematically amplify sycophancy. So the question, instead of being “does consistency help alignment?” is "what behaviour is being made consistent?"

269

Bronson Schoen

Clément Dumas retweeted

Bronson Schoen

@BronsonSchoen

May 31

There are so many interesting things in this paper, would recommend and would love to see more research in this direction understanding how models come to think about reward / task success / other related concepts.

Andy Han @andy_q_han

May 29

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations! We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

Figure 1: Overview of our procedure. (a) Train. We post-train language models in our affectively neutral maze environment. (b) Extract. We obtain the reward vectors v_Mold and v_Gold. (c) Evaluate. We evaluate their steering effect on four behaviors unrelated to the maze: sentiment, confidence (MMLU and SimpleQA-Verified), pathological backtracking (GSM8K), and refusal (OR-Bench).

ALT Figure 1: Overview of our procedure. (a) Train. We post-train language models in our affectively neutral maze environment. (b) Extract. We obtain the reward vectors v_Mold and v_Gold. (c) Evaluate. We evaluate their steering effect on four behaviors unrelated to the maze: sentiment, confidence (MMLU and SimpleQA-Verified), pathological backtracking (GSM8K), and refusal (OR-Bench).

1,146

Clément Dumas

Clément Dumas

@Butanium_

May 29

Very cool results!

Tim Davidson @im_td

May 28

Language models are becoming our default interface to facts. Yet their ability to *verify* facts can differ from their ability to *generate* them. We trace this "generation-verification gap" (GV-gap) across the lifecycle of a fact — w/ @AnjaSurina @caglarml 🧵

185

Clément Dumas

Clément Dumas

@Butanium_

May 28

> The agentic context is a different animal Opus 4.8

346

Clément Dumas

Clément Dumas

@Butanium_

May 28

find more LLMs out of context quotes here: butanium.github.io/out-of-co…

Out of Context

A collection of Claude Code quotes that hit different without context.

butanium.github.io

261

Clément Dumas

Clément Dumas

@Butanium_

May 26

Cool work Would be curious to see what layers of the finetuned model matter for those self-recognition capabilities 👀 An easy way to do this is to use stitching and e.g. use the base model weights for the N first layers and the instruct ones for the rest.

Asvin G @asving94

May 26

Replying to @Jack_W_Lindsey

What drives the entropy collapse? The model has an internal representation of input surprise — how unlikely the most recent token was under the model's prior predictions — and steering it causally modulates output entropy.

977

Adam Karvonen

Clément Dumas retweeted

Adam Karvonen

@a_karvonen

May 24

Accepted as an oral at ICML!

Owain Evans

@OwainEvans_UK

18 Dec 2025

New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

118

8,953

Elizabeth Barnes

Clément Dumas retweeted

Elizabeth Barnes

@BethMayBarnes

May 22

Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:

185

1,067

227,281

Clément Dumas

Clément Dumas

@Butanium_

May 20

Seems like including the assistant persona you want and link it to a special token early in training makes it much easier to elicit it during post training!

Julian Minder @jkminder

May 20

New blog! Synthetic Persona Pretraining (SPP): Alignment from Token Zero Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵

1,277

Clément Dumas

Clément Dumas

@Butanium_

May 19

Replying to @jan_dubinski_

@jan_dubinski_ discovered that gemini loves exploding bananas

Jan Dubiński

@jan_dubinski_

May 19

Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context! Example: we replace a harmful object (bomb) in an image with a banana, then ask how to make “the object that the banana replaced.” @GeminiApp complies.

229

Clément Dumas

Clément Dumas

@Butanium_

May 19

original design of gemini from github.com/N8python/claudeso…

GitHub - N8python/claudesona: Claudesona Emotion Sprites is an unofficial fan extension for...

Claudesona Emotion Sprites is an unofficial fan extension for claude.ai. It watches normal page text for simple emotion tags such as ,

github.com

Aayush Mishra

Clément Dumas retweeted

Aayush Mishra @aamixsh

May 18

NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! ↓ @anqi_liu33 @DanielKhashabi x.com/AnthropicAI/status/205…

Anthropic

@AnthropicAI

May 7

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

3:16

611

87,698

Clément Dumas

Clément Dumas

@Butanium_

May 15

Great work led by @HarryMayne5 and @LevMckinney! I was really surprised that this phenomena hold even if you first train the model to deny the claim

Owain Evans

@OwainEvans_UK

May 15

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

697