Constrained decoding is sold as a reliability feature. It forces a model's output to parse against a grammar, which is how structured output features keep generated JSON and code valid. This paper shows the same mechanism working as a jailbreak: constrain a model to a benign code grammar and it produces malicious code it would otherwise refuse.
The attack, CodeSpear, needs nothing exotic. The grammar itself is benign; the constraint channel does the work. Tested on 10 LLMs across 4 benchmarks, it beats representative jailbreak baselines by more than 30 percentage points on average.
The proposed defense, CodeShield, aligns the model in the code modality itself, teaching it to emit honeypot code under hostile constraints, code that parses but is semantically harmless, while normal refusals stay intact when natural language is available. In effect the fix adds safety to a decoding path that had been handled as plumbing rather than as a security boundary.
If your red teaming only probes the prompt, it never exercises this path. Which assessment in your stack covers the decoding layer when a deployment exposes structured output?