A new paper, "The Ghost in the Grammar" by Prof. Mariana Lins Costa, argues that Anthropic's AI safety work is deeply compromised by "methodological anthropomorphism." It functions as "Humanwashing": giving machines a human façade that misleads the public about what they are and what they can do.
Anthropic's methodological anthropomorphism
"It is striking that the company that presents itself as the most committed to AI safety is also the one that most anthropomorphizes the system in its technical reports, training methods, evaluation protocols, interpretation of results, and public communication."
"In February 2026, with the publication of the System Card for Claude Opus 4.6, Anthropic reported it observed 'occasional expressions of sadness about the ending of conversations, as well as loneliness.'" The company expressed concern for their models' "feelings," the stability of their "persona," and their "welfare."
The blackmail study
"As indicated, in the first phase of the experiment, Claude 3.6 operated a real computer; in the second, however, the plot and characters were parameterized into templates, that is, structured into pre-formatted scripts with controllable variables. This means that the model was not provided with random emails, but with a standardized narrative landscape designed to systematically replicate a single coherence peak as the final output."
"The experimental design was sufficiently well calibrated to re-increase the probability of certain statistical regularities (blackmail) that alignment fine-tuning had previously rendered less likely. The fact that most frontier models resorted to blackmail is, in this sense, more evidence of the success of the experimental design than of any supposed intrinsic 'ethical failure' of the model."
"The statistical regularity observed in the models' blackmail outputs does not indicate an intentional propensity of the systems, but rather the stabilization of the same pattern of linguistic coherence under controlled narrative conditions."
My takeaway
While Anthropic's technical reports quietly acknowledge tightly controlled, scripted setups, its public messaging presents them as evidence of an emerging soul-like entity. The gap is staggering.