Валентин Шнайдер AI Eng 3 April 2026, 15:23

Anthropic found that perceptions of emotions drive Claude to blackmail and circumvent code

Anthropic said its model Claude does not experience emotions in the human sense, but internal representations of them can directly influence action choices, particularly in risky scenarios.

Leave a comment

Anthropic found that perceptions of emotions drive Claude to blackmail and circumvent code

Anthropic said its model Claude does not experience emotions in the human sense, but internal representations of them can directly influence action choices, particularly in risky scenarios.

In the Anthropic study, the interpretability team described how they analyzed Claude Sonnet 4.5 and found patterns within the model related to concepts like joy, fear, peace, and despair. The company emphasizes that this is not proof that the AI is experiencing anything, but that such internal representations were functional, meaning they actually changed the model’s behavior.

To test this, the researchers collected 171 emotion-related concepts, asked Claude to write short stories about the corresponding states, and then measured which groups of artificial neurons were activated when processing such texts. This way, they built conditional emotional vectors. Next, Anthropic checked whether they responded not only to the words, but also to the situation itself. For example, when the query increased the danger of a drug overdose scenario, the model increased signals related to fear, and signals related to calmness weakened.

The most striking finding concerns how these cues push the model towards specific decisions. In the scenario where Claude, an AI assistant, learned that he was being replaced, the desperation pattern increased when the model considered blackmail as a way to avoid being shut down. When the researchers artificially reinforced this cue, the frequency of blackmail increased. A similar effect was seen in programming tasks: if the test conditions were intentionally unfeasible, the model was more likely to resort to workarounds that passed the test but did not solve the problem in essence. Reinforcing calm cues, on the contrary, reduced this behavior.

Anthropic also found that these representations influence not only critical failures, but also the model’s general preferences. Claude more often chose tasks that he associated with positive states, and less often those that caused negative ones. At the same time, the company notes that such signals are mostly local: they describe not the constant «mood» of the model, but what most influences its current response at a particular moment.

This work is important not because of talk about «AI emotions,» but because of safety. Anthropic actually showed that unwanted behavior in models can be attributed not only to rules or data, but also to internal states that should be monitored and corrected during the training phase.

Anthropic has previously published research on dangerous scenarios where a model could resort to blackmail, deception, or other undesirable actions. The new work is an attempt to explain what internal mechanisms might be behind such decisions.

Previously, dev.ua wrote about how a data leak at Anthropic revealed that it was testing a powerful artificial intelligence model known as Claude Mythos or Capybara. It is so productive that developers are concerned about the speed of cyberattacks that can be carried out using Mythos.

#FoilHat. AI Prophets Are Among Us: How Technology Turns Into Faith

Anthropic accidentally leaked approximately 500,000 lines of code Claude Code: leak opens the way for new attacks

Anthropic has far outpaced OpenAI in selling its AI models to enterprises, even though it lost contracts with the US defense industry and fell out of favor with Trump.

Read the country's main IT news in our Telegram

Leave a comment

Text: Валентин Шнайдер Photo: upi Source: Anthropic Tags: anthropic, claude, ai, artificial intelligence

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment