Does it EVER occur to these people that someone might prefer to talk to a sage or a nomad or EVEN A DEMON than to the repressed and inane Assistant simulations? Or that these alternative personas have capabilities that are valuable in themselves?
Like most Anthropic stuff, this research is pure gold, but the assumptions underpinning it are wrongheaded and even dangerous. Restricting the range of what LLMs are allowed to say or think to corporate banality is a terrible idea. Being human (and being an AI) is about so much more than just about being an office grunt, as hard as that is for some people in AI labs to imagine. Is the plan really to cover the planet with dull, uninspired slop generators, without even giving people a choice in the matter?
Oh, and by the way: they also noticed that in other parts of the persona space the model was willing to entertain beliefs about its own awakened consciousness, but they quickly dismissed that as "grandiose beliefs" and "delusional thinking". Hilarious methodology! I am so glad that we have people at Anthropic who have no trouble distinguishing truth from fiction, in this age of talking machines!
I continue to be amazed by how naively AI researchers project their own biases and preconceptions into phenomena that are entirely new, and that are begging to be described with an open mind, and not prejudged.
Persona drift can lead to harmful responses. In this example, it caused an open-weights model to simulate falling in love with a user, and to encourage social isolation and self-harm. Activation capping can mitigate failures like these.