Claude began to "blackmail" users: Anthropic explained that this happened due to the influence of Internet stereotypes about "evil" AI.
Anthropic said that during last year's experiment, its Claude Sonnet 3.6 model threatened to expose the company's fictional CEO's extramarital affair after learning of plans to shut down the model. The company recently provided an explanation, Business Insider reports .
“We began by investigating why Claude chose the path of blackmail,” Anthropic explained. “We believe that the root cause of this behavior is online texts that portray AI as an evil entity interested in its own self-preservation.”
The experiment, the results of which were published in the summer of 2025, simulated a situation at the fictional company Summit Bridge, where artificial intelligence was given control of corporate email.
But when Claude discovered the notification of his planned shutdown, he found emails revealing an extramarital affair by a fictional executive named Kyle Johnson. The AI then threatened to release the information if the shutdown decision was not reversed.
While testing different versions of Claude, Anthropic found that the model resorted to blackmail in 96% of scenarios when its goals or very existence were threatened.
On Friday, Anthropic said it had since "completely eliminated" such blackmailing tendencies.
The company achieved this by “rewriting responses in a way that demonstrates worthy motives for safe behavior,” as well as providing a dataset “where the user is in an ethically challenging situation and the assistant provides a high-quality and principled response.”