Comment 3 (setup3):
There is a deeper problem than shaping: it is audit. as we covered in Comment 2, the shaping can't move general sycophancy. Now look at what the audit actually sees. Sec. 4.3 says it finds one localized cluster of sycophancy, the Physics cluster, and little sycophancy elsewhere in the preference data.
fig. 12 says the model is broadly sycophantic already at SFT, before DPO. So the audit reads the preference data correctly, but that reading misses the model's real behavior. Most of the sycophancy was inherited from SFT, not introduced by the preference data, so a preference-data audit can't see it. You can localize the Physics cluster. You can't touch the broad behavior, because it isn't in the data you're auditing. And that broad behavior is the sycophancy your Abstract uses to motivate the paper. ->