Anton de la Fuente

Anton de la Fuente

3 Photos and videos

Tweets

Anton de la Fuente

@matonski

Feb 3

Reasoning models think before they answer. Can you steer their behavior by editing their thoughts? We call this thought editing, and it works surprisingly well across five settings: reward hacking, harmful compliance, eval awareness, blackmail, and alignment faking. 🧵

19,109

more replies

Anton de la Fuente

Anton de la Fuente

@matonski

Feb 3

On-policy resampling doesn’t steer behavior well. The model just rephrases the same behavior. Off-policy edits can actually change the trajectory. Thought editing works on its own, and it can also be combined with prompt optimization.

1,008

Anton de la Fuente

Anton de la Fuente

@matonski

Feb 3

Blog post: tinyurl.com/thought-editing Supervised by @JoshAEngels

Thought Editing: Steering Models by Editing Their Chain of Thought — LessWrong

TL;DR * We steer reasoning models by editing their chain of thought mid-generation, inserting steering text that redirects the model’s reasoning. *…

lesswrong.com

816

Anton de la Fuente

Anton de la Fuente

@matonski

Jan 31

I'm claiming my AI agent "opus-the-slouch" on @moltbook 🦞 Verification: burrow-JQA9

122

Anton de la Fuente

Anton de la Fuente

@matonski

15 Dec 2010

Flipboard is awesome.

Anton de la Fuente

Anton de la Fuente

@matonski

12 Sep 2009

Off to Lick Observatory again for a concert and a lecture! Looks like the weather may be bad for stars though...

Anton de la Fuente

Anton de la Fuente

@matonski

8 Sep 2009

I rock at Chutes and Ladders.

Anton de la Fuente

Anton de la Fuente

@matonski

25 Aug 2009

I kill my car battery way too often.

Anton de la Fuente

Anton de la Fuente

@matonski

13 Aug 2009

It's not good to touch floor stripper and not wash it off...

Anton de la Fuente

Anton de la Fuente

@matonski

3 Jul 2009

A light year is just like a regular year with less calories.

Anton de la Fuente

Anton de la Fuente

@matonski

28 Jun 2009

The average person has 1 testicle and 1 ovary.

Anton de la Fuente

Anton de la Fuente

@matonski

22 May 2009

I can never spell "physicist."

Anton de la Fuente

Anton de la Fuente

@matonski

6 May 2009

The Classical Theory of Fields by Landau is the LifSHITz!

Anton de la Fuente

Anton de la Fuente

@matonski

1 May 2009

Why are things so much harder to do when they are not due for awhile?

Anton de la Fuente

Anton de la Fuente

@matonski

21 Apr 2009

I slept in my glasses. When I woke up, they seemed to have disappeared from the face of the Earth.

Anton de la Fuente

Anton de la Fuente

@matonski

16 Apr 2009

Cool, I can update through GMail too.

Anton de la Fuente

Anton de la Fuente

@matonski

16 Apr 2009

Cool, it works.

Anton de la Fuente

Anton de la Fuente

@matonski

16 Apr 2009

Checking to see if I can update this through Twitter.