Reasoning models think before they answer. Can you steer their behavior by editing their thoughts?
We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵