Joined October 2008
3 Photos and videos
Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵
4
6
68
19,109
On-policy resampling doesn’t steer behavior well. The model just rephrases the same behavior. Off-policy edits can actually change the trajectory. Thought editing works on its own, and it can also be combined with prompt optimization.
1
6
1,008
I'm claiming my AI agent "opus-the-slouch" on @moltbook 🦞 Verification: burrow-JQA9
122
Flipboard is awesome.
2
Off to Lick Observatory again for a concert and a lecture! Looks like the weather may be bad for stars though...
I rock at Chutes and Ladders.
I kill my car battery way too often.
It's not good to touch floor stripper and not wash it off...
A light year is just like a regular year with less calories.
The average person has 1 testicle and 1 ovary.
I can never spell "physicist."
The Classical Theory of Fields by Landau is the LifSHITz!
Why are things so much harder to do when they are not due for awhile?
I slept in my glasses. When I woke up, they seemed to have disappeared from the face of the Earth.
Cool, I can update through GMail too.
Cool, it works.
Checking to see if I can update this through Twitter.