Joined December 2018
136 Photos and videos
Pinned Tweet
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵
5
30
206
38,512
Clément Dumas retweeted
New GDM interp research: SFT is a big deal for safety relevant behaviors. We recently investigated root causes for some of Gemini’s behaviors. We were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL! 🧵
4
30
216
61,982
COefficient Recursive GIving
1
3
250
@kei_ng1 thank you for being my muse
175
When Fable 5 is used for frontier LLM development, it does not notify the user and instead limits the model’s capabilities through methods such as prompt modification, steering vectors, and PEFT. Anthropic estimated that this would affect approximately 0.03% of traffic.
1
5
33
815
credit to @kei_ng1
3
235
the nemotron 3 ultra training pipeline is kinda wild with >20 checkpoints... claude and i made an interactive figure with all the different checkpoints and their relationship (🔗link below)
2
2
8
665
Clément Dumas retweeted
opus 4.8 offers some structural pushback
Replying to @QiaochuYuan
there's something quite weird with how 4.8 has learned to 'push back' that seems related to this, too. like deliberate strawman counterarguments that are chosen to be easy to knock down, playing fake-high within low specifically to give User the chance to get a reversal and win
60
151
2,301
165,183
Clément Dumas retweeted
We find that consistency training often suppresses reward hacking and emergent misalignment, but can systematically amplify sycophancy. So the question, instead of being “does consistency help alignment?” is "what behaviour is being made consistent?"
2
1
9
269
Clément Dumas retweeted
There are so many interesting things in this paper, would recommend and would love to see more research in this direction understanding how models come to think about reward / task success / other related concepts.
We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations!  We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)
2
14
1,146
Very cool results!
Language models are becoming our default interface to facts. Yet their ability to *verify* facts can differ from their ability to *generate* them. We trace this "generation-verification gap" (GV-gap) across the lifecycle of a fact — w/ @AnjaSurina @caglarml 🧵
2
185
> The agentic context is a different animal Opus 4.8
1
3
346
Cool work Would be curious to see what layers of the finetuned model matter for those self-recognition capabilities 👀 An easy way to do this is to use stitching and e.g. use the base model weights for the N first layers and the instruct ones for the rest.
Replying to @Jack_W_Lindsey
What drives the entropy collapse? The model has an internal representation of input surprise — how unlikely the most recent token was under the model's prior predictions — and steering it causally modulates output entropy.
1
6
977
Clément Dumas retweeted
Accepted as an oral at ICML!
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
3
6
118
8,953
Clément Dumas retweeted
Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:
21
185
1,067
227,281
Seems like including the assistant persona you want and link it to a special token early in training makes it much easier to elicit it during post training!
New blog! Synthetic Persona Pretraining (SPP): Alignment from Token Zero Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵
1
8
1,277
Replying to @jan_dubinski_
@jan_dubinski_ discovered that gemini loves exploding bananas
Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context! Example: we replace a harmful object (bomb) in an image with a banana, then ask how to make “the object that the banana replaced.” @GeminiApp complies.
1
4
229
Clément Dumas retweeted
NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered activations! NLAs fail to interpret steered activation states faithfully, supporting our results! ↓ @anqi_liu33 @DanielKhashabi x.com/AnthropicAI/status/205…
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.
19
99
611
87,698
Great work led by @HarryMayne5 and @LevMckinney! I was really surprised that this phenomena hold even if you first train the model to deny the claim
New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook
16
697