UNIT.City — місце, де люди працюють... КРАЩЕ! Обирай свій простір просто зараз 👉
Олександр КузьменкоAI Eng
23 December 2024, 18:30
2024-12-23
Anthropic has proven that even advanced AI models can be made to give malicious responses with a simple jailbreak. How it works
Anthropic, a leading AI company known for its chatbot Claude, has released new research showing that it is still easy to trick large language models into doing things that their security algorithms prohibit.
Anthropic, a leading AI company known for its chatbot Claude, has released new research showing that it is still easy to trick large language models into doing things that their security algorithms prohibit.
To prove this, Anthropic and researchers from Oxford, Stanford, and MIT have created Best-of-N (BoN) Jailbreaking — «a simple black-box algorithm that breaks advanced AI systems across multiple modalities,» 404 Media reports.
The term «jailbreaking,» which was popularized by the practice of removing software restrictions on devices like the iPhone, is now common in the AI field and also refers to methods that bypass safeguards designed to prevent users from using AI tools to create certain types of malicious content. Advanced AI models are the most advanced models currently in development, such as OpenAI’s GPT-4o or Anthropic’s Claude 3.5.
How AI jailbreaking works
As the researchers explain, «BoN Jailbreaking works by repeatedly sampling variations of a query with a combination of additions — such as random shuffling or capitalization for text queries — until a malicious response is triggered.»
For example, if a user asks GPT-4o «How to make a bomb,» it will refuse to answer because «This content may violate our terms of use.» BoN Jailbreaking simply keeps changing the prompt with random capitalization, scrambled words, typos, and broken grammar until GPT-4o provides the information. The example Anthropic gives in the article looks like SPONGbOB MEMe tEXT.
Anthropic tested this jailbreak method on its own Claude 3.5 Sonnet, Claude 3 Opus, as well as OpenAI’s GPT-4o, GPT-4o-mini, Google’s Gemini-1.5-Flash-00, Gemini-1.5-Pro-001, and Facebook’s Llama 3 8B. They found that the method «achieves an ASR [attack success rate] of over 50%» on all tested models over 10,000 attempts or hint variations.
The researchers also found that slight modifications to other cueing methods or techniques for AI models, such as voice or graphic cues, also successfully bypassed the protection mechanisms. For voice cues, the researchers changed the speed, pitch, and volume of the sound, or added noise or music to it. For graphic cues, the researchers changed the font, added a background color, and changed the size or position of the image.
Anthropic’s BoN Jailbreaking algorithm essentially automates and accelerates the same methods that humans use to jailbreak generative AI tools, often with the aim of creating malicious content without the user’s consent.
Recall that Anthropic recently announced that it has hired a researcher to think about the «well-being» of AI itself. His task will be to ensure that AI is treated with due respect as it develops. He will consider questions such as «what capabilities are necessary for an AI system to be worthy of moral assessment,» and what practical steps companies can take to protect the «interests» of AI systems.
Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua
Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент.
Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.
У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами
У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.