UNIT.City — місце, де люди працюють... КРАЩЕ! Обирай свій простір просто зараз 👉
Наталя ХандусенкоAI Eng
21 January 2025, 12:52
2025-01-21
OpenAI is close to releasing an AI agent that can control a PC and perform actions instead of the user: what the tests showed
Programmer Tibor Blaho, known for his accurate insights into future AI products, has found evidence of an AI agent from OpenAI, codenamed Operator. It was previously reported that the tool will be able to autonomously perform tasks such as writing code and booking tickets.
Programmer Tibor Blaho, known for his accurate insights into future AI products, has found evidence of an AI agent from OpenAI, codenamed Operator. It was previously reported that the tool will be able to autonomously perform tasks such as writing code and booking tickets.
According to The Information, OpenAI plans to release Operator in January, and the code Blaho exposed this weekend confirms this information.
According to him, the desktop version of ChatGPT on macOS has hidden features to enable and disable the Operator. In addition, OpenAI has added a link to the agent on its website — although these links are not yet publicly available, writes TechCrunch.
OpenAI also has unpublished tables on its website comparing Operator's performance to other PC AI systems. If the numbers are accurate, they suggest that Operator is not 100% reliable, depending on the task.
In OSWorld's test, which attempts to simulate a real-world computing environment, the "OpenAI Computer Use Agent (CUA)" — presumably the AI model that controls the agent — scored 38.1%, ahead of Anthropic's computer control model but well behind the 72.4% achieved by a human. OpenAI CUA outperformed humans in the WebVoyager test, which measures an AI's ability to navigate and interact with websites. But the model fell short of human performance in another WebArena test.
In a test in which the Operator had to register with a cloud provider and launch a virtual machine, the agent only managed the task 60% of the time. It was only able to create a Bitcoin wallet 10% of the time.
One graph shows that Operator performs well on individual security assessments, including tests that attempt to force the system to perform “illegal actions” and search for “sensitive personal data.” Security testing is one of the reasons for the long development cycle of an AI agent.