Max Lamparth

Max Lamparth

81 Photos and videos

Tweets

Pinned Tweet

Max Lamparth @MLamparth

May 29

New paper: We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. We find it active in GRPO, in SOTA reward models, and published methods.

4,959

Deb Raji

Max Lamparth retweeted

Deb Raji @rajiinio

Jun 12

This happened 8 days ago. It's unsettling to think that the covert Fable subversion of AI research could have been implemented as some mechanism to indirectly achieve this goal. Part of a broader "end justifies the means" approach to AI safety that makes me so uncomfortable.

6,329

Deb Raji

Max Lamparth retweeted

Deb Raji @rajiinio

Jun 12

I do not want to do AI research that is reactive to what these companies are doing, or even what they're saying. The entire field keeps chasing after product releases. Some spend more time reading marketing copy than their colleagues papers and I just... do not want to do that?

126

7,549

Max Lamparth

Max Lamparth @MLamparth

Jun 11

Is this satire? @AnthropicAI's decision to gate-keep public AI research after benefitting over years from academia and other parts of society is indefensible. I hope the community will remember.

nature

@Nature

Jun 11

Calling all researchers using Anthropic's AI model Claude: how are you using the new Claude Fable 5 model in your research? We want to hear about the most impressive things it's built for your research projects or left you asking what the fuss is all about. Can it do things you couldn't do before? Let us know.

2,170

Max Lamparth

Max Lamparth @MLamparth

Jun 11

x.com/MLamparth/status/20648…

Max Lamparth @MLamparth

Jun 11

Replying to @AnthropicAI

@AnthropicAI 's Opus 4.8 unusable overnight for a benign clarifying question when reading a safety research paper on game theoretic RL. Tells me to use Haiku. What a joke. Absolutely unacceptable and I want a refund.

342

Max Lamparth

Max Lamparth @MLamparth

Jun 11

Big release! An eval score you cannot trace is a score you cannot trust. Evaluation Cards link every aggregate claim back to its benchmark, split, and metric. Now more important than ever IYKYK.

EvalEval Coalition @evaluatingevals

Jun 11

🚀We launch Evaluation Cards (beta): a centralized public record of AI evaluation results 🚀 Not another leaderboard. Every score comes with who ran it, the settings they used, what the benchmark tests and the other results reported for the same model, side by side. 🧵👇

327

Max Lamparth

Max Lamparth @MLamparth

Jun 11

Glad to have contributed to this under the leadership of @evijit, @AnkaReuel, Jenny Chim, Wm. Matthew Kennedy. Also, thank you for the support enabling my contributions: @HooverInst, @SISLaboratory, Stanford Center for AI Safety

132

Max Lamparth

Max Lamparth @MLamparth

Jun 11

Replying to @AnthropicAI

1,791

Max Lamparth

Max Lamparth @MLamparth

Jun 11

For what it's worth, seems to be classifying the great work by @HaichuanWang23 @konglingkai_AI @MilindTambe_AI et al. as too SOTA. Check out their paper: arxiv.org/pdf/2602.02572

138

Fei-Fei Li

Max Lamparth retweeted

Fei-Fei Li

@drfeifei

Jun 10

Scientific research is fundamental to advancing civilization and helping people globally to solve the most critical problems, from medicine to materials, from brain science to physics, and much beyond. This is only possible when scientists have access to the best tools of the time to conduct scientific research, including having access to AI-based tools.

119

468

3,079

190,021

alphaXiv

Max Lamparth retweeted

alphaXiv

@askalphaxiv

Jun 10

As believers of open research, we are disappointed to see Anthropic silently degrading Fable 5 for AI development "Any topic related to building pretraining pipelines, distributed training infrastructure, or ML accelerator design... may have limited effectiveness through Claude via methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning." Not only do they get to decide what you use LLMs for in research, but this also enables them to silently intervene in your research without you knowing. This sets a dangerous precedent. If a model refuses openly, users can understand the boundary. If a model falls back to another model, users can still evaluate the difference. But if a model silently modifies or weakens its own answers while still pretending to help, researchers lose the ability to know whether a failed result came from their own idea, their implementation, or an invisible intervention by the model provider. That is not safety. Safety policies should be transparent, auditable, and user-visible. On top of that, the people most harmed by this are not the largest labs with massive teams and proprietary infrastructure. It is the independent researchers, academic groups, startups, and open-source builders who rely on public tools to compete, innovate, and pioneer AI for everyone else.

166

721

3,865

220,528

Eric Zelikman

Max Lamparth retweeted

Eric Zelikman

@ericzelikman

Jun 9

imagine telling your customers there's a small chance you'll randomly decide they're using your product wrong and you won't tell them but will secretly silently sabotage their work

206

2,991

107,799

Nathan Lambert

Max Lamparth retweeted

Nathan Lambert

@natolambert

Jun 9

Why I think Anthropic's uneven safety policies with the release of Claude Fable 5 undermine the broader AI community's cohesion and accelerate us to more uncertainty and risk in AI's near-term evolution. interconnects.ai/p/claude-fa…

Claude Fable 5 and new safety fables

One step further into the power politics of frontier AI systems.

interconnects.ai

407

36,126

Nathan Lambert

Max Lamparth retweeted

Nathan Lambert

@natolambert

Jun 9

Labs starting to pull up the ladders on the ability to diffuse AI was inevitable. Doing it without telling the user is misaligned.

NomoreID

@Hangsiin

Jun 9

When Fable 5 is used for frontier LLM development, it does not notify the user and instead limits the model’s capabilities through methods such as prompt modification, steering vectors, and PEFT. Anthropic estimated that this would affect approximately 0.03% of traffic.

187

1,893

287,431

elie

Max Lamparth retweeted

elie

@eliebakouch

Jun 9

mythos will be bad ON PURPOSE on ai "frontier llm research" tasks, this is very very sad for the research community also the fact that this is un purpose not visible to the user is crazy

Claude

@claudeai

Jun 9

Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use. Its capabilities exceed those of any model we’ve ever made generally available.

0:20

359

644

5,636

3,882,302

Houjun Liu

Max Lamparth retweeted

Houjun Liu @houjun_liu

Jun 9

And so I stop using Claude ig

elie

@eliebakouch

Jun 9

mythos will be bad ON PURPOSE on ai "frontier llm research" tasks, this is very very sad for the research community also the fact that this is un purpose not visible to the user is crazy

1,328

Max Lamparth

Max Lamparth @MLamparth

Jun 3

Had a great time presenting our paper on Reward Bias Substitution as an oral at #RLEval Workshop at @CAISconf #CAIS2026 last week. Thanks to everyone who came by and asked such thoughtful questions!

765

Max Lamparth

Max Lamparth @MLamparth

Jun 3

We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. x.com/MLamparth/status/20603…

Max Lamparth @MLamparth

May 29

296

Max Lamparth

Max Lamparth @MLamparth

Jun 3

Check it out: arxiv.org/abs/2605.27996 "Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure" Joint work with @DanielFein7 , @andreas_h0wpt , @marcel_hussing , and @aiprof_mykel

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect...

Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a...

arxiv.org

116

Cas (Stephen Casper)

Max Lamparth retweeted

Cas (Stephen Casper)

@StephenLCasper

Jun 1

Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do. With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI.

806

51,294

Avanika Narayan

Max Lamparth retweeted

Avanika Narayan

@Avanika15

May 29

Honored to be published in @ForeignAffairs. For 2 years we've made the case for local AI. Here's the geopolitical angle: the U.S.–China race isn't about who trains the best model. It's about whose models, chips, and frameworks run by default on billions of devices. Joint work w/@jdunnmon & @JonSaadFalcon (1/N)

25,841

Cassidy Laidlaw

Max Lamparth retweeted

Cassidy Laidlaw

@cassidy_laidlaw

May 27

We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵

13,189