this started with a striking PC1 falling out of persona space
my main insights from the past few months:
⊹ “distance from the Assistant” is the main axis of persona variation across these models e.g. the most relevant thing seems to be “how Assistant-like is this persona”
⊹ this axis already exists in base models and steering with it makes them speak from the POV of helpful archetypes like therapists, coaches, and consultants
⊹ not all personas far from the Assistant are bad! the risk comes from departing the more predictable territory of post-trained behaviour
still have a lot of questions about what to anthropomorphize, what to treat as fundamentally alien…
New Anthropic Fellows research: the Assistant Axis.
When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?