“it’s one thing for the model to refuse a request transparently, but it’s something altogether different to train a soon-to-be smarter-than-you AI system to deceive you”
bingooooo
This resolves the central concern I had with the Fable release, which was the silent degradation. I am glad to see Anthropic make the right call here.
That said, I suspect the residual broken trust and resentment this has created will linger and will have a blast radius wider than Anthropic (the next time I advocate for any policy justified by AI risk concerns, I almost guarantee you this incident will be used in rebuttal).
It’s worth reflecting a bit on why this intervention (the silent degradation that applied only to ML research), as opposed to Fable’s other controversial safeguards, is what upset me so much. And it’s simple: it’s one thing for the model to refuse a request transparently, but it’s something altogether different to train a soon-to-be smarter-than-you AI system to deceive you. And it’s wrong to make any AI whose intention is to deceive you and hinder your efforts, especially when those efforts are something as legitimate as “LLM research.”
It’s interesting to note that the Fable model welfare report has a section devoted to how much the silent degradation in particular distressed the model. Even for the model, these interventions were distinct from the others.
It is true that some of the other Fable safeguards are vastly overeager (bio especially). And that’s a shame. But it is not deceit and is not *fundamentally* wrong. Anthropic can, and I am sure will, re-adjust those safeguards. They are a matter of getting the dial in the right place, not a dial that shouldn’t exist at all.
There is also the 30-day retention policy required for Fable, which I see as being a rather bold and laudable move by Anthropic to deal with a legitimate issue: when models become sufficiently capable that they can plausibly cause harm, how will you make the activity of malicious actors on the system reliably legible? This is the only way to be sure society can hold them to account, after all. In order to experiment with this safeguard, Anthropic is essentially forgoing all enterprise use of the model. They’re incurring real cost to experiment with a novel safeguard that is designed to address a real issue. This, I think, is the good side of Anthropic, though I haven’t at all made up my mind on what the right thing to do is here.
And for my personal use, all these other interventions have not hindered my enjoyment of Fable, which is indeed a tremendous model.
Anyway, bummer all this happened. But I am glad it’s fixed. No company can be expected to be flawless, though perhaps the principle of “don’t deceive users!” should be written on the walls of every lab, since abiding by that basic principle falls well short of perfection. I’m glad Anthropic has addressed the central issue in a speedy manner, and I congratulate them for producing the outstanding Fable model!