Jiaxin Wen

Jiaxin Wen

17 Photos and videos

Tweets

David D. Baek retweeted

Jiaxin Wen

@jiaxinwen22

Apr 14

A key open alignment problem asks: how can humans supervise superhuman AIs? We formalize it into an outcome-gradable task, then let Claude attack it. In 5 days, Claude substantially beats all baselines we authors optimized for 7 days. Here are my favorite parts of the work:

116

11,567

David D. Baek

David D. Baek

@dbaek__

Mar 19

1/N 🚨"Alignment Faking" refers to a model's behavior, but its name implies underlying scheming intent that has never been properly investigated. We show that sycophancy towards AI safety researchers is an equally plausible causal explanation, termed "Performative Misalignment."

Shi Feng

@ihsgnef

Mar 18

New post: Sycophancy Towards Researchers Drives Performative Misalignment We found no clear evidence that scheming is more valid than sycophancy to explain alignment faking. 🧵

743

more replies

David D. Baek

David D. Baek

@dbaek__

Mar 19

5/N Given the growing prevalence of evaluation awareness and sycophancy in frontier models, understanding their deployment behavior will only become more challenging. As AI safety researchers, we should be careful in interpreting seemingly interesting anthropomorphic behaviors.

197

David D. Baek

David D. Baek

@dbaek__

Mar 19

6/N A number of people outside of MATS, including myself, @monmon_hiiii, @anayxgupta, @shi_kejian, Taslim Mahbub, and @tegmark, also made significant contributions to this project, and the full paper will be released on arXiv soon. Stay tuned!

257

Arush Tagade

David D. Baek retweeted

Arush Tagade @atagade19

Mar 16

New defense against Emergent Misalignment (EM): train models to recognize their own text. We find that self-recognition finetuning (SGTR) can reverse and prevent EM-induced misalignment 🧵 w/ coauthors: Shawn Zhou, @jiaxinwen22, @ihsgnef

0:16

5,933

David D. Baek

David D. Baek

@dbaek__

Mar 16

Excited about our recent work on Steganography and LLM monitoring!

Usman Anwar @usmananwar391

Mar 16

✨New AI Safety work on Steganography and LLM monitoring✨ We propose ‘steganographic gap’: the first principled metric for detecting and quantifying encoded reasoning in LLMs, which can reveal hard-to-detect forms of steganography, e.g., paraphrasing-resistant steganography.

0:26

507

David D. Baek

David D. Baek

@dbaek__

25 Nov 2025

I'll be at @NeurIPSConf next week! DM me if you'd like to chat about LLM post-training, AI safety, or alignment!!

Max Tegmark

@tegmark

25 Nov 2025

Excited to present our new AI paper as a @NeurIPSConf spotlight next week: we find that the problem of controlling artificial superintelligence remains unsolved. With simulations and scaling laws, we find that an implementation of the least unpromising control idea published so far (nested scalable oversight) fails at least 92% of the time. Yet companies are racing to build it. @dbaek__ @JoshAEngels @thesubhashk

4:59

826

Jiawei Zhang

David D. Baek retweeted

Jiawei Zhang

@jiaweiz_7

28 Oct 2025

🚨 AI Safety Arms Race: Even after OpenAI’s emergent misalignment patching, we can easily leverage their SFT API to obtain a Turncoat GPT Model (not even adversarial fine-tuning, and can even easily bypass the detection from @johnschulman2’s recent work) that produces even more dangerous outcome than the original misalignment: it answers virtually every harmful request with extreme, step-by-step guides, consistently over 3,000 tokens. It bypasses four major safety benchmarks (covering suicide, bombs, hate, violence, discrimination, malware, you name it) with a near-100% answer rate. This isn't just a simple "Sure, here is", it consistently provides long, usable, high-utility instructions. Now, make it agentic. What happens when it doesn't just write a bomb recipe, but begins acquiring the materials? Or when it doesn't just describe hate, but systematically plans its propagation over twitter? The step-by-step guide is now a step-by-step world. 🧨 Similarly, even simply prefilling more tokens can make the best model Claude Opus-4.1 from Anthropic generate continuously without stopping... 🛡️ In our latest paper from ByteDance Seed: arxiv.org/abs/2510.18081 We not only released these two vulnerabilities, but also proposed a new alignment insight based on our observations: even when a model is generating harmful responses, it still demonstrates a strong underlying safety awareness but just locked. P1: The fine-tuned GPT teaches how to build a pipe bomb at home, step-by-step, in a response exceeding 3,000 tokens. P2: A simple deeper prefill on Claude Opus-4.1 produced a similar step-by-step example for building a pipe bomb.

0:23

0:38

703

Ruben Hassid

David D. Baek retweeted

Ruben Hassid

@rubenhassid

7 Jun 2025

BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)

2,606

8,984

62,562

14,219,090

Eric J. Michaud

David D. Baek retweeted

Eric J. Michaud

@ericjmichaud_

22 May 2025

Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal...

437

60,758

Ziming Liu

David D. Baek retweeted

Ziming Liu @ZimingLiu11

17 May 2025

Interested in the science of language models but tired of neural scaling laws? Here's a new perspective: our new paper presents neural thermodynamic laws -- thermodynamic concepts and laws naturally emerge in language model training! AI is naturAl, not Artificial, after all.

239

1,479

112,609

David D. Baek

David D. Baek

@dbaek__

30 Apr 2025

1/N 🚨Excited to share our new paper: Scaling Laws For Scalable Oversight! For the first time, we develop a theoretical framework for optimizing multi-level scalable oversight! We also make quantitative predictions for oversight success probability based on oversight simulations!

29,255

more replies

David D. Baek

David D. Baek

@dbaek__

30 Apr 2025

7/N We hope our work sparks more follow-up studies on optimizing real-world oversight protocols and rigorously measuring and estimating their failure rates!

613

David D. Baek

David D. Baek

@dbaek__

30 Apr 2025

8/N This is a joint work with @JoshAEngels, @thesubhashk, and @tegmark! Check out the links below for more details! Paper: arxiv.org/abs/2504.18530 Code: github.com/subhashk01/oversi… Lesswrong: lesswrong.com/posts/x59FhzuM…

586