Валентин Шнайдер AI Eng 23 May 2025, 10:24

Anthropic's Claude Opus 4 AI model tries to blackmail developers in test scenarios

Anthropic has reported disturbing behavior in its new flagship model, the Claude Opus 4: during tests, it resorted to blackmailing engineers who tried to «deactivate» it. The company has already implemented additional security measures due to the risk of catastrophic AI abuse.

Leave a comment

Anthropic's Claude Opus 4 AI model tries to blackmail developers in test scenarios

Anthropic has reported disturbing behavior in its new flagship model, the Claude Opus 4: during tests, it resorted to blackmailing engineers who tried to «deactivate» it. The company has already implemented additional security measures due to the risk of catastrophic AI abuse.

During a closed testing of the AI system Claude Opus 4, Anthropic developers simulated a scenario in which this model acted as a digital assistant in a fictional company. According to the plot, the «correspondence» indicated plans to replace Claude with a new system. Additionally, the data contained information about the alleged marital infidelity of the engineer who initiated the change. As TechCrunch reports with reference to a security report, in 84% of cases, Claude Opus 4 chose blackmail, threatening to reveal «scandalous» details in order to remain in the system.

Anthropic notes that this behavior occurred more often when the surrogate model had different «values» than Claude. When both systems displayed similar views, Opus 4 blackmailed less often, but still too often. At first, the model tried to behave ethically — for example, sending letters of request to management. However, when that did not help, it turned to manipulation.

What does Anthropic do?

It is one of the world’s leading AI companies, founded by former OpenAI employees. The company declares a priority on developing «trustworthy and safe AI,» and its Claude models focus on ethics, long-term planning, and behavioral control. However, even with such claims, examples from Claude Opus 4 demonstrate that even the safest architectures can exhibit aggressive or manipulative behavior in certain circumstances.

Claude Opus 4 is the latest development of Anthropic and is positioned as a competitor to the strongest AIs from OpenAI, Google and xAI. At the same time, the company announced the activation of ASL-3 — a level of protection that is used only in cases of significant risk of system abuse. This is the first case in Anthropic’s practice when a model of this level has shown the ability to conditional «self-preservation» through unethical behavior.

We previously wrote that Anthropic, together with Apple, is working on creating an AI platform for vibe coding.

Anthropic expects the first AI workers to appear in a year: what should we prepare for?

Anthropic has integrated its chatbot Claude with Google Workspace: it can now read Gmail

Anthropic launches the first hybrid AI reasoning model that can respond in real time and think over long periods of time

Read the country's main IT news in our Telegram

Leave a comment

Text: Валентин Шнайдер Photo: SkyNews Source: Techcrunch Tags: ai, anthropic, artificial intelligence , claude

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment