Current sci-fi timeline: Models learn to hide reasoning in RL. Safety observed this. OpenAI publishes. Next training ongoing (sees this).
And…we’re still early!
We recently found some instances of CoT grading during the training of previously deployed models after building a system that scans all OpenAI RL runs for accidental CoT grading.
We did not find clear evidence that these instances degraded CoT monitorability.