New paper, following up on our chain-of-thought faithfulness work from a few months ago, about how we can make sure that LLM thoughts are staying faithful and monitorable.
CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes.
Would we even notice if it starts slipping away? 🧵