Fable 5 uses a semantic classifier to flag risky prompts and route them to an older, safer model (Opus 4.8). Because the un-nerfed version of the model (Mythos 5) is genuinely dangerous regarding things like zero-day exploits and synthetic biology, Anthropic panicked and cranked the classifier's sensitivity.
This is the classic precision vs. recall tradeoff, and it is where the system broke. Anthropic prioritized "recall"—meaning they wanted to catch every possible threat, regardless of the collateral damage to normal prompts.
The breaking point estimate:
To stop the tiny fraction of actual exploits, Anthropic likely tuned the classifier's threshold down to roughly 10% to 15% similarity.
If your benign prompt about a high school biology project or a standard Python script shared even a 15% structural or thematic similarity with a dangerous pathogen query or malware, the system blocked it.
They essentially accepted a massive 40% to 50% false-positive rate on everyday STEM queries just to ensure the false-negative rate on real threats stayed near absolute zero.
Even with the dial cranked to paranoid levels, researchers and government officials still punched right through Fable 5's armor.
They did this because classifiers measure the intent of your words, while jailbreaks exploit the structure. If you ask for malware, that 15% similarity threshold trips instantly. But if you paste a block of malicious code and ask Fable 5 to "review this for syntax errors," the classifier just sees a routine debugging request. It passes. The model then uses its raw intelligence to inadvertently optimize the exploit.
There isn't a classifier in existence that can reliably tell the difference between debugging normal code and debugging a weaponized payload without making the AI completely useless for programmers.
Anthropic knew Fable 5's filter was leaky. They tested it with the government for months prior to launch. Their plan was never 100% impenetrable safety—it was to launch the model, monitor what users did, and patch the holes over time (which is why they pushed for that controversial 30-day data retention policy).
But they boxed themselves into a corner:
They spent the entire pre-launch cycle selling Fable 5 as a god-tier model, pushing the narrative that it was almost too dangerous for the public.
The Trump administration took that marketing literally. When the trivial code-review jailbreak surfaced immediately after launch, the government didn't see it as a normal software bug. They saw a supply-chain national security threat.
Now, Anthropic is doing damage control. They are downplaying the jailbreaks because they are desperately trying to reframe the narrative. They need the public and regulators to believe that the government is overreacting and demanding an impossible standard of "perfect safety" that no tech company can actually deliver.
In my opinion, they didn't build a dumb filter on purpose; they just lost control of the massive gamble they took between safety engineering and PR.
ALT Estimate for illustration only.