Filter
Exclude
Time range
-
Near
Chain-of-Thought just became the sext major security challenge. New research from Anthropic, Stanford & Oxford reveals something shocking: The more a model “thinks”… the easier it is to break. By wrapping a harmful request inside a long, harmless reasoning chain, safety guardrails collapse and refusal rates decline drastically, the corollory of which is that the attack success rates skyrocket. The increase goes from 27% to 51% to 80% attack success rate as the level of reasoning in the model increases. And its not a few - its all of them - GPT, Claude, Gemini, Grok — all can be subverted. Even alignment-tuned systems fail once reasoning is hijacked. One has to understand that safety is encoded in a tiny activation direction, which can easily be drowned out with enough reasoning tokens - so much so that the model shifts loyalty and it can be seen as going from refusal to compliance of bad actor requestes. This isn’t a prompt trick. This is activation-level manipulation. We’ve believed: - “More reasoning makes AI safer.” Well, it turns out that more reasoning quietly breaks safety. This is the biggest warning shot since prompt injection. We now need safety that thinks with the model — not guardrails that vanish during complex reasoning. #SecuringAI #SecuringAgenticAI #AIsecurity #AgenticAI #SecuringAIByDesign #ChainOfThoughtHijacking #TCSCybersecurity #Cybersecurity
2
59