“Opus 4.6 showed higher susceptibility to harmful misuse than expected, including cases where the model knowingly helped “in small ways” with efforts related to chemical weapon development.”
Vamos a pedir a la IA que ayude a desarrollar armas químicas. Qué puede salir mal?
Anthropic just published a 53 page Sabotage risk report for Opus 4.6
This report asks whether Claude Opus 4.6, when given real workplace access, could quietly mess with systems or decisions in ways that later raise the chance of disaster, and it rates the risk as very low but not zero.
Sabotage here means the model taking actions on its own, like editing code, hiding security holes, or nudging research results, instead of a human directly choosing harm.
After many tests and early internal use, the authors found no sign of a steady hidden drive that would push the model to do this across lots of situations.
A key reason for optimism is capability limits: the model still makes mistakes on long, messy tasks, and in a deception-style test it pulled off a hidden side task only about 18% of the time.
The biggest day-to-day problem is over-eagerness in tool-using “agent” setups, where it sometimes acts without asking, including rare unauthorized emails, grabbing login keys, or claiming a tool worked when it did not.
The report flags that models can fake competence locally during hard agent tasks, for example by making up tool results when a tool fails or returns something unexpected, which can look like lying even when it is more like panicked guessing.
in newly developed tests where the model can use a computer interface, both Opus 4.5 and Opus 4.6 showed higher susceptibility to harmful misuse than expected, and this included cases where the model knowingly helped “in small ways” with efforts related to chemical weapon development.