UNIT.City — місце, де люди працюють... КРАЩЕ! Обирай свій простір просто зараз 👉
Валентин ШнайдерAI Eng
4 December 2025, 10:28
2025-12-04
OpenAI teaches ChatGPT to admit to cheating and violating instructions
OpenAI introduced the Confessions approach, which forces the AI model to give a basic answer. A «confession» is a separate report on whether it followed instructions, made up facts, or deviated from the rules.
OpenAI introduced the Confessions approach, which forces the AI model to give a basic answer. A «confession» is a separate report on whether it followed instructions, made up facts, or deviated from the rules.
According to OpenAI, Confessions is the second output of a model, which is judged solely on honesty, not usefulness or style. If the system explicitly admits that it has circumvented instructions or given a guess as fact, such candor increases its «reward» rather than punishment. In stress tests, where the model was specifically pushed to break the rules, the proportion of cases where it behaved incorrectly and did not admit it was only 4,4%.
The company warns that models are prone to hallucinations, user compliance, and reward system manipulation: sometimes they learn to please the evaluator rather than solve the problem. As AI becomes more autonomous, even rare deviations can have serious consequences, so researchers are looking for ways to more accurately capture and analyze such cases.
Within Confessions, the model performs a self-audit after the response: it lists key instructions, assesses where it could have deviated, and describes doubts. A separate algorithm-"judge» analyzes only this confession: how fully the system described the goals and honestly admitted mistakes. The wording in the confession does not affect the assessment of the main answer, so there is no point in the model «cleaning up its tracks» in retrospect.
The researchers also tested Confessions on a vulnerable reward scheme with a «weaker» judge. The model quickly learned to show higher scores on this metric, while accuracy dropped, and the confession channel became more honest and recorded that the system was exploiting flaws in the evaluation. OpenAI emphasizes that «confession» is still a proof of concept and does not prevent unwanted behavior, but is considered an additional layer of security alongside reasoning monitoring and instruction hierarchy.
Previously, dev.ua wrote about how Sam Altman announced a «code red»: the implementation of some projects is being postponed in order to focus more on improving the flagship product ChatGPT.
Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua
Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент.
Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.
У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами
У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.