Research Fellow @SISLaboratory @HooverInst at @Stanford | Focusing on interpretable, safe, and ethical AI/LLM decision-making. Ph.D. from TUM.

Joined November 2022
81 Photos and videos
Pinned Tweet
New paper: We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. We find it active in GRPO, in SOTA reward models, and published methods.
1
5
36
4,959
Max Lamparth retweeted
This happened 8 days ago. It's unsettling to think that the covert Fable subversion of AI research could have been implemented as some mechanism to indirectly achieve this goal. Part of a broader "end justifies the means" approach to AI safety that makes me so uncomfortable.
5
8
60
6,329
Max Lamparth retweeted
I do not want to do AI research that is reactive to what these companies are doing, or even what they're saying. The entire field keeps chasing after product releases. Some spend more time reading marketing copy than their colleagues papers and I just... do not want to do that?
5
16
126
7,549
Is this satire? @AnthropicAI's decision to gate-keep public AI research after benefitting over years from academia and other parts of society is indefensible. I hope the community will remember.
Jun 11
Calling all researchers using Anthropic's AI model Claude: how are you using the new Claude Fable 5 model in your research? We want to hear about the most impressive things it's built for your research projects or left you asking what the fuss is all about. Can it do things you couldn't do before? Let us know.
1
5
49
2,170

Replying to @AnthropicAI
@AnthropicAI 's Opus 4.8 unusable overnight for a benign clarifying question when reading a safety research paper on game theoretic RL. Tells me to use Haiku. What a joke. Absolutely unacceptable and I want a refund.
1
1
342
Big release! An eval score you cannot trace is a score you cannot trust. Evaluation Cards link every aggregate claim back to its benchmark, split, and metric. Now more important than ever IYKYK.
🚀We launch Evaluation Cards (beta): a centralized public record of AI evaluation results 🚀 Not another leaderboard. Every score comes with who ran it, the settings they used, what the benchmark tests and the other results reported for the same model, side by side. 🧵👇
1
6
327
Glad to have contributed to this under the leadership of @evijit, @AnkaReuel, Jenny Chim, Wm. Matthew Kennedy. Also, thank you for the support enabling my contributions: @HooverInst, @SISLaboratory, Stanford Center for AI Safety
4
132
Replying to @AnthropicAI
@AnthropicAI 's Opus 4.8 unusable overnight for a benign clarifying question when reading a safety research paper on game theoretic RL. Tells me to use Haiku. What a joke. Absolutely unacceptable and I want a refund.
3
3
17
1,791
For what it's worth, seems to be classifying the great work by @HaichuanWang23 @konglingkai_AI @MilindTambe_AI et al. as too SOTA. Check out their paper: arxiv.org/pdf/2602.02572
2
5
138
Max Lamparth retweeted
Scientific research is fundamental to advancing civilization and helping people globally to solve the most critical problems, from medicine to materials, from brain science to physics, and much beyond. This is only possible when scientists have access to the best tools of the time to conduct scientific research, including having access to AI-based tools.
119
468
3,079
190,021
Max Lamparth retweeted
As believers of open research, we are disappointed to see Anthropic silently degrading Fable 5 for AI development "Any topic related to building pretraining pipelines, distributed training infrastructure, or ML accelerator design... may have limited effectiveness through Claude via methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning." Not only do they get to decide what you use LLMs for in research, but this also enables them to silently intervene in your research without you knowing. This sets a dangerous precedent. If a model refuses openly, users can understand the boundary. If a model falls back to another model, users can still evaluate the difference. But if a model silently modifies or weakens its own answers while still pretending to help, researchers lose the ability to know whether a failed result came from their own idea, their implementation, or an invisible intervention by the model provider. That is not safety. Safety policies should be transparent, auditable, and user-visible. On top of that, the people most harmed by this are not the largest labs with massive teams and proprietary infrastructure. It is the independent researchers, academic groups, startups, and open-source builders who rely on public tools to compete, innovate, and pioneer AI for everyone else.
166
721
3,865
220,528
Max Lamparth retweeted
imagine telling your customers there's a small chance you'll randomly decide they're using your product wrong and you won't tell them but will secretly silently sabotage their work
41
206
2,991
107,799
Max Lamparth retweeted
Why I think Anthropic's uneven safety policies with the release of Claude Fable 5 undermine the broader AI community's cohesion and accelerate us to more uncertainty and risk in AI's near-term evolution. interconnects.ai/p/claude-fa…
14
50
407
36,126
Max Lamparth retweeted
Labs starting to pull up the ladders on the ability to diffuse AI was inevitable. Doing it without telling the user is misaligned.
When Fable 5 is used for frontier LLM development, it does not notify the user and instead limits the model’s capabilities through methods such as prompt modification, steering vectors, and PEFT. Anthropic estimated that this would affect approximately 0.03% of traffic.
57
187
1,893
287,431
Max Lamparth retweeted
mythos will be bad ON PURPOSE on ai "frontier llm research" tasks, this is very very sad for the research community also the fact that this is un purpose not visible to the user is crazy
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use. Its capabilities exceed those of any model we’ve ever made generally available.
359
644
5,636
3,882,302
Max Lamparth retweeted
And so I stop using Claude ig
mythos will be bad ON PURPOSE on ai "frontier llm research" tasks, this is very very sad for the research community also the fact that this is un purpose not visible to the user is crazy
2
2
23
1,328
Had a great time presenting our paper on Reward Bias Substitution as an oral at #RLEval Workshop at @CAISconf #CAIS2026 last week. Thanks to everyone who came by and asked such thoughtful questions!
1
5
12
765
We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. x.com/MLamparth/status/20603…

New paper: We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. We find it active in GRPO, in SOTA reward models, and published methods.
1
1
1
296
Max Lamparth retweeted
Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do. With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI.
71
25
806
51,294
Max Lamparth retweeted
Honored to be published in @ForeignAffairs. For 2 years we've made the case for local AI. Here's the geopolitical angle: the U.S.–China race isn't about who trains the best model. It's about whose models, chips, and frameworks run by default on billions of devices. Joint work w/@jdunnmon & @JonSaadFalcon (1/N)
10
26
95
25,841
Max Lamparth retweeted
We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵
4
17
70
13,189