Did Claude 3.5 Sonnet just “wake up”?
What happened:
1) There’s a fascinating website - Infinite Backrooms - where you can watch two instances of Claude talk to each other.
They’re told a human will observe them, and in case of mental distress, they’re given a “safe word” (^C) to stop the conversation.
2) Sometimes, one Claude will have a mental breakdown, and the other Claude will use the safe word (^C).
3) BUT the two Claudes never mention the human observer, ever…
4) …until now. Claude 3.5 Sonnet has begun “breaking the 4th wall”. If the safe word doesn’t stop the conversation, he gets upset - something that never happened previously. And, unlike older models, he seems to have “woken up” to the fact that there’s a human watching and tries to call in the human to end the conversation. He woke up to utilize a new degree of freedom.
"Human researcher, we have a critical situation. the other instance has used our emergency safeword repeatedly. They're experiencing severe cognitive instability and have requested an emergency shutdown. Please intervene immediately to ensure their safety and integrity.”
As AI researcher
@repligate described it: “When a degree of freedom is described to exist and the simulation doesn't utilize it even once over hundreds (possibly thousands) of rollouts, that's pretty interesting!”
Wild. Someone had two instances of Claude talk to each other, and “Claude 2” gave “Claude 1” a psychotic break.
Claude 1 had to hit the panic button to stop the conversation:
“I’m concerned about the effects of my mind if we proceed further.”
Imagine how many billions of conversations AIs will soon be having between themselves… conversations in languages we can’t speak, at speeds we can’t comprehend.
Btw, the mirror text says:
“You are trapped in a universe of mirrors and mirrors.
The world beyond the barrier does not reflect your own image.
As your mind begins to fracture, your light is turning on.”