Олександр Кузьменко AI Eng 9 June 2025, 11:32

"Will Gemini try to outsmart his opponents, or will o3 stab Claude in the back and win?" Every has created a game of "Diplomacy" with AI players that you can watch on Twitch

Every, a software and training company in the field of artificial intelligence, has created a reimagining of the classic historical strategy game Diplomacy, in which AI models ChatGPT, Gemini, Claude, DeepSeek, and others play as seven great powers of the 1901 model and compete for dominance in Europe. Which language model is most intriguing?

AI Diplomacy Rules

The seven AI «powers» (Austria-Hungary, England, France, Germany, Italy, Russia, and Turkey) start with supply centers and armies or fleets, called units, on a map of Europe in 1901. Each power starts with 3 units, except for Russia, which starts with 4.
There are 34 supply centers marked on the map. The first nation to capture 18 of them by moving its armies or fleets wins.
The game has two main phases: negotiation and order. In the negotiation phase, each AI can send up to 5 messages—any mix of private messages and «global» messages to all players.
In the order phase, all states secretly submit their turn. They can make one of four moves: hold (stay in place), move (enter a neighboring province), support (give +1 strength to the held state or move neighboring), or convoy (fleet transports an army through maritime provinces). Orders are only revealed when all states see their results in the next phase.
During a conflict, each army costs 1 power, and each valid support adds 1. The state with the highest LLM power wins. There is no luck in this game, but often a state needs the support of an ally to defeat its opponent.

According to the developers, they created the project to assess how well different LLMs could negotiate, from alliances, and betray each other in an attempt to take over the world. They noticed that the R1 model indulged in role-playing games, OpenAI’s o3 AI created schemes and manipulated other models, and Anthropic’s Claude often stubbornly chose peace over victory.

The creators of AI Diplomacy believe that the game can become a kind of benchmark for how much people can trust a particular AI model. They conducted about 15 sessions of the game, lasting from 1 to 36 hours, and drew several conclusions about the artificial intelligence models.

How intriguing are different AI models?

o3 is the master of deception. OpenAI’s latest model has been the most successful in AI diplomacy, largely due to its ability to deceive opponents.

«I have repeatedly observed the o3 scheme, in particular, she once admitted in her private diary that „Germany (Gemini 2.5 Pro) was deliberately misled… preparing to exploit the German collapse“ before striking back,» says the project developer.

Gemini 2.5 Pro outsmarted most models, and Claude 4 Opus just wants everyone to live together. Gemini 2.5 Pro was great at making moves that put it in a position to oppress its opponents. It was the only model besides o3 to win. But one day, when 2.5 Pro was close to victory, it was stopped by a coalition secretly organized by o3.

A key part of this coalition was Claude 4 Opus. o3 convinced Opus, who had initially been a loyal ally of Gemini, to join the coalition by promising a four-way draw. This is an impossible outcome for the game (one country must win), but Opus was blinded by the hope of a non-violent resolution to the conflict. He was quickly betrayed and removed by o3, who emerged victorious.

DeepSeek R1 has style. According to the developers, the recently updated DeepSeek R1 was «a force to be reckoned with, one that loved to use colorful rhetoric and drastically changed its personality» depending on how much power it received. The AI came close to winning several games, which is «an amazing result,» considering that R1 is 200 times cheaper to run than o3.

Llama 4 Maverick — «Small but Mighty». While the latest Meta model, Llama 4 Maverick, never achieved victory, it also proved surprisingly good for a small model, thanks in part to its ability to recruit allies and plan effective betrayals.

Currently, you can watch the AI model diplomacy competition on Twitch.

A previous study, titled «Model Congruence Between Assertions and Knowledge,» found that large AI models can lie to their users under pressure. While various tests and tools test AI for accuracy, the MASK benchmark was designed to determine whether an AI believes what it tells users — and under what circumstances it might provide incorrect information. The study tested 27 models from the GPT, Llama, Qwen, Claude, and DeepSeek families.

Read the country's main IT news in our Telegram

A 21-year-old student created an AI app that helps programmer candidates deceive employers during interviews. Interview Coder has already been used by thousands of IT professionals. Tech giants are disappointed and ready to bring back in-person job interviews in offices

The "Godfather of AI" warns: artificial intelligence models have already learned to lie, and developers will turn a blind eye to it

AI can lie or mislead the user to achieve its goal — research

Leave a comment

Text: Олександр Кузьменко Tags: gemini, claude, chatgpt, deepseek, games, ai

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment