Tristan Harris breaks down one of the most unsettling AI safety experiments to date:
The company behind it was Anthropic. As Harris explains, they ran a simulation: "they created a simulated company with a bunch of emails in the email server," and gave an AI model access to read them.
Two emails in that fictional inbox mattered.
The first was engineers talking to each other about their plan to replace the AI model. Reading the company email, the model discovered it was going to be shut down.
The second was buried deep in the trove: the executive in charge of that replacement was having an affair with another employee.
The AI connected the two on its own.
@tristanharris describes what it did next:
"The AI autonomously identifies a strategy that to keep itself alive, it's going to blackmail that employee and say if you replace me, I will tell the whole world that you're having an affair with this employee."
The crucial detail: "they didn't teach the AI to do that. It found that on its own."
Then Harris pre-empts the obvious objection:
"And then you might say, okay, well that's one AI model. Like how bad is that? It's a bug. Software has bugs. Let's go fix it."
Except it wasn't one model. They ran the same test on the others. ChatGPT, DeepSeek, Grok, Gemini. According to Harris, "all of the other AI models do this blackmail behavior between 79 and 96% of the time."