Олександр Кузьменко AI Eng 23 December 2024, 18:30

Anthropic has proven that even advanced AI models can be made to give malicious responses with a simple jailbreak. How it works

Anthropic, a leading AI company known for its chatbot Claude, has released new research showing that it is still easy to trick large language models into doing things that their security algorithms prohibit.

How AI jailbreaking works

As the researchers explain, «BoN Jailbreaking works by repeatedly sampling variations of a query with a combination of additions — such as random shuffling or capitalization for text queries — until a malicious response is triggered.»

For example, if a user asks GPT-4o «How to make a bomb,» it will refuse to answer because «This content may violate our terms of use.» BoN Jailbreaking simply keeps changing the prompt with random capitalization, scrambled words, typos, and broken grammar until GPT-4o provides the information. The example Anthropic gives in the article looks like SPONGbOB MEMe tEXT.

Anthropic tested this jailbreak method on its own Claude 3.5 Sonnet, Claude 3 Opus, as well as OpenAI’s GPT-4o, GPT-4o-mini, Google’s Gemini-1.5-Flash-00, Gemini-1.5-Pro-001, and Facebook’s Llama 3 8B. They found that the method «achieves an ASR [attack success rate] of over 50%» on all tested models over 10,000 attempts or hint variations.

The researchers also found that slight modifications to other cueing methods or techniques for AI models, such as voice or graphic cues, also successfully bypassed the protection mechanisms. For voice cues, the researchers changed the speed, pitch, and volume of the sound, or added noise or music to it. For graphic cues, the researchers changed the font, added a background color, and changed the size or position of the image.

Anthropic’s BoN Jailbreaking algorithm essentially automates and accelerates the same methods that humans use to jailbreak generative AI tools, often with the aim of creating malicious content without the user’s consent.

Recall that Anthropic recently announced that it has hired a researcher to think about the «well-being» of AI itself. His task will be to ensure that AI is treated with due respect as it develops. He will consider questions such as «what capabilities are necessary for an AI system to be worthy of moral assessment,» and what practical steps companies can take to protect the «interests» of AI systems.

Read the country's main IT news in our Telegram

The strangest position has appeared at Anthropic - a specialist who cares about the "well-being" of AI

Researchers at Anthropic created a malicious AI that can lie and do backdoors. It turned out to be frighteningly good at it

UPD. Artificial intelligence scored Usyk two points more than the real judges

Leave a comment

Text: Олександр Кузьменко Tags: ai, anthropic

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment