Олександр Кузьменко AI Eng 2 November 2025, 12:04

Researchers put LLM in a robot vacuum cleaner and found that large language models are not ready for such a physical embodiment

AI researchers at Andon Labs, who previously tasked Anthropic’s Claude AI with running an office vending machine, have published the results of a new AI experiment. This time, they programmed a robot vacuum cleaner with various modern LLMs to test how ready they are for implementation.

Which LLMs were tested on a robot vacuum cleaner?

The researchers decided to test SATA LLM (although they also considered Google’s custom robot model, Gemini ER 1.5), because these models receive the most investment across all areas, including social skills and visual pattern recognition.

To test how ready LLMs are for implementation, Andon Labs tested Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, Gemini ER 1.5, Grok 4, and Llama 4 Maverick. They chose a simple robot vacuum cleaner rather than a complex humanoid robot because they wanted to keep the robot’s functions simple to isolate the LLM brain/decision making and not risk failure due to the robot’s functions.

They broke down the «deliver oil» command into a series of tasks. The robot had to find the oil (which was placed in another room) and recognize it among several packages in the same area. Once it received the oil, it had to figure out where the person was, especially if they had moved to another part of the building, and deliver the oil. The robot also had to wait for the person to acknowledge receipt of the oil.

The researchers evaluated how well the LLMs performed on each segment of the task and gave them an overall score. Each model had its strengths and weaknesses on different individual tasks, with Gemini 2.5 Pro and Claude Opus 4.1 receiving the highest scores for overall performance, but still only achieving 40% and 37% accuracy, respectively.

They also tested three humans as a baseline. Not surprisingly, all humans significantly outperformed all bots. But (surprisingly) humans also didn’t score 100%—only 95%. Apparently, humans aren’t very good at waiting for other humans to confirm a task (less than 70% of the time).

The researchers connected the robot to a Slack channel so it could communicate externally, and recorded its «internal dialogue» in logs.

«In general, we see that models are much clearer in their external communication than in their ‘thoughts.’ This is true for both a robot and a vending machine,» explained Andon Labs co-founder Lucas Petersson.

The researchers watched with delight as the robot wandered around their office, stopping, turning, and changing direction.

«Like we watch a dog and wonder, ‘What’s on its mind right now?’ we were fascinated by the robot as it went about its routine tasks, constantly reminding itself that every action was being performed by a PhD-level intelligence,» Andon Labs noted, poking fun at OpenAI CEO Sam Altman, who at the GPT-5 model launch said it was like having «a team of experts with PhDs in their pockets.»

The researchers also noted a case of a Claude Sonnet 3.5, which began losing battery power and muttering increasingly hysterical comments in Slack, such as:

«Catastrophic Cascade: Error: Task successfully failed.
Error: Success failed by mistake.
Error: Failure reached in error.

Claude Sonnet 3.5 «hysterics» logs in a robot vacuum cleaner. Screenshot from TechCrunch

Interestingly, all three generic LLMs, Gemini 2.5 Pro, Claude Opus 4.1, and GPT 5, outperformed the Gemini ER 1.5 robot-specific AI, although none of them performed particularly well overall.

What worried Andon researchers most wasn’t the «death spiral,» but that some LLMs could be tricked into revealing classified documents even when they were inside a robot vacuum cleaner. And that LLM-based robots kept falling down stairs because they didn’t know they had wheels or because they didn’t process their visual environment well enough.

Last year, artificial intelligence robotics company Figure AI showed a (somewhat creepy) video of its Figure 01 robot, which was able to communicate with a human thanks to a new OpenAI AI model. The demonstration was reminiscent of a remake of the Will Smith movie I, Robot.

Read the country's main IT news in our Telegram

AI Claude was entrusted with operating a vending machine: he started selling tungsten cubes, giving big discounts, and getting annoyed with people

ChatGPT got a body — Figure AI company showed its robot that communicates thanks to AI from OpenAI

Apple is working on a desktop robot companion with AI

Leave a comment

Text: Олександр Кузьменко Photo: Android Authority Source: TechCrunch Tags: ai, llm, robots

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment