Наталя Хандусенко AI Eng 7 May 2026, 13:59

When an AI “clicks” on a website, it spends 45 times more tokens than when accessing via an API

Companies that use AI agents to automate processes risk overpaying if their digital assistants simply copy how a person interacts with a screen, according to a study that compared the performance of visual and API agents.

Leave a comment

When an AI “clicks” on a website, it spends 45 times more tokens than when accessing via an API

Companies that use AI agents to automate processes risk overpaying if their digital assistants simply copy how a person interacts with a screen, according to a study that compared the performance of visual and API agents.

The study was conducted by the corporate solutions platform Reflex, The Register reports .

A visual agent in this context is an AI agent that simulates human interaction, relying on image processing and optical character recognition (OCR) to operate an application. In this case, it is the Claude Sonnet model, which manages the interface of a web application via browser-use 0.12, a tool for automated browser management.

API agent here means Claude Sonnet, which interacts with a web application through special tools and APIs. The agent invokes the same processing mechanisms as the graphical interface (UI) and receives structured data in response, rather than a screenshot of a web page that needs to be analyzed.

“Two agents work with the same active application: one controls the interface through screenshots and clicks, and the other accesses the application’s HTTP endpoints directly,” explained Palash Awasthi, head of development at Reflex. “Same Claude Sonnet model, same fixed dataset, same task. The only variable is the interface.”

Each agent was given the following task: "A customer named Smith complained about a recent order. Find Smith with the most orders, accept all of his pending reviews, and mark the latest order as delivered."

According to Avast, the API agent completed the task in just eight calls. It displayed a list of pending reviews, accepted them, and marked the order as delivered.

Instead, the visual agent found only one of the four reviews because it was unable to scroll the page to where the other three reviews were hidden.

Visually analyzing and interpreting a web page is a fundamentally more difficult task for an AI model than interacting with API calls and tools.

Even when the prompt was adjusted to help the visual model perform better, the agent took about 17 minutes—significantly longer than the API agent, which took about 20 seconds. The visual agent also used many more tokens—about 45 times as many.

The company has made this test available as a benchmark for those interested in reproducing the results.

Avasti noted that the difference in cost between these two approaches is due to the architecture itself: visual agents need to “see,” and that’s expensive — processing each screenshot requires thousands of input tokens.

According to Anthropic estimates, processing a 1000×1000 pixel image with the Claude Sonnet 4.6 model consumes about 1,334 tokens.

The vision agent spent about 500,000 input tokens and about 38,000 output tokens to complete its task. The API agent used about 12,150 input tokens and about 934 output tokens.

For Avast, the lesson is that while visual analytics agents may be necessary for interacting with applications you don't control, inward-facing agents should be API-centric.

How to save AI tokens? A selection of tools created by IT professionals to track expenses at Claude Code

AI tokens are becoming part of the reward in IT — a new trend in Silicon Valley

Nvidia CEO invented a new "trick" to lure IT professionals - tokens

Read the country's main IT news in our Telegram

Leave a comment

Text: Наталя Хандусенко Tags: ai, ai agent, ai agents, ai-agent

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment