OpenAI found that misaligned models can develop a bad boy persona, allowing for detection.
But what if models are conditionally misaligned, having backdoors?
We find that backdoored models retain a helpful persona in CoT
Instead, models state that the user wants harmful actions
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more
We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated
🧵: