Валентин Шнайдер AI Eng 4 December 2025, 10:28

OpenAI teaches ChatGPT to admit to cheating and violating instructions

OpenAI introduced the Confessions approach, which forces the AI model to give a basic answer. A «confession» is a separate report on whether it followed instructions, made up facts, or deviated from the rules.

Leave a comment

OpenAI teaches ChatGPT to admit to cheating and violating instructions

OpenAI introduced the Confessions approach, which forces the AI model to give a basic answer. A «confession» is a separate report on whether it followed instructions, made up facts, or deviated from the rules.

According to OpenAI, Confessions is the second output of a model, which is judged solely on honesty, not usefulness or style. If the system explicitly admits that it has circumvented instructions or given a guess as fact, such candor increases its «reward» rather than punishment. In stress tests, where the model was specifically pushed to break the rules, the proportion of cases where it behaved incorrectly and did not admit it was only 4,4%.

The company warns that models are prone to hallucinations, user compliance, and reward system manipulation: sometimes they learn to please the evaluator rather than solve the problem. As AI becomes more autonomous, even rare deviations can have serious consequences, so researchers are looking for ways to more accurately capture and analyze such cases.

Within Confessions, the model performs a self-audit after the response: it lists key instructions, assesses where it could have deviated, and describes doubts. A separate algorithm-"judge» analyzes only this confession: how fully the system described the goals and honestly admitted mistakes. The wording in the confession does not affect the assessment of the main answer, so there is no point in the model «cleaning up its tracks» in retrospect.

The researchers also tested Confessions on a vulnerable reward scheme with a «weaker» judge. The model quickly learned to show higher scores on this metric, while accuracy dropped, and the confession channel became more honest and recorded that the system was exploiting flaws in the evaluation. OpenAI emphasizes that «confession» is still a proof of concept and does not prevent unwanted behavior, but is considered an additional layer of security alongside reasoning monitoring and instruction hierarchy.

Previously, dev.ua wrote about how Sam Altman announced a «code red»: the implementation of some projects is being postponed in order to focus more on improving the flagship product ChatGPT.

OpenAI said that ChatGPT advised a teenager who had committed suicide to seek help more than 100 times

ChatGPT queries helped US law enforcement track down California arson suspect

Police arrest 13-year-old who asked ChatGPT how to kill a friend. He was “turned in” to law enforcement by another AI

Read the country's main IT news in our Telegram

Leave a comment

Text: Валентин Шнайдер Photo: OpenAI Source: OpenAI Tags: openai, chatgpt, ai, ai bot, ai assistant, artificial intelligence

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment