We focus on the refusal concept due to its strong steering performance and relevance in AI safety, though our interpretability framework is generalizable to any steering concept. I am curious how our findings hold up for steering sycophancy, hallucinations, truthfulness, etc.