keshav

keshav

5 Photos and videos

Tweets

Stephen Cheng retweeted

keshav @kshenoy_

Apr 28

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

560

290,067

Stephen Cheng

Stephen Cheng @stepscheng

Apr 24

Excited to share my recent work on interpreting LM steering vectors! Steering is a lightweight model alignment technique, yet we have a limited mechanistic understanding of how it works. We perform a case study on refusal steering to investigate. arxiv.org/abs/2604.08524

What Drives Representation Steering? A Mechanistic Case Study on...

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what...

arxiv.org

8,277

more replies

Stephen Cheng

Stephen Cheng @stepscheng

Apr 24

We focus on the refusal concept due to its strong steering performance and relevance in AI safety, though our interpretability framework is generalizable to any steering concept. I am curious how our findings hold up for steering sycophancy, hallucinations, truthfulness, etc.

107

Stephen Cheng

Stephen Cheng @stepscheng

Apr 24

Thank you @sarahwiegreffe and @dmanocha for advising!

101