Наталя Хандусенко AI Eng 2 April 2025, 18:10

“Our content is free, our infrastructure is not.” Wikipedia fights voracious scraper bots that collect information to train AI models

Since January 2024, Wikimedia has seen a 50% increase in bandwidth used to download multimedia content. But that's not because people are reading more Wikipedia articles, watching videos, or downloading files from Wikimedia Commons. No, the surge in usage is due to search bots, or automated programs that search for images, videos, articles, and other openly licensed Wikipedia files to train generative artificial intelligence models.

Leave a comment

“Our content is free, our infrastructure is not.” Wikipedia fights voracious scraper bots that collect information to train AI models

Since January 2024, Wikimedia has seen a 50% increase in bandwidth used to download multimedia content. But that's not because people are reading more Wikipedia articles, watching videos, or downloading files from Wikimedia Commons. No, the surge in usage is due to search bots, or automated programs that search for images, videos, articles, and other openly licensed Wikipedia files to train generative artificial intelligence models.

Such a sudden increase in traffic from bots can slow down access to Wikimedia pages and resources, especially during events of great interest, writes Engadget.

For example, when Jimmy Carter died in December, the increased interest in a video of his presidential debate with Ronald Reagan caused pages to load slowly for some users. Wikimedia is able to handle spikes in traffic from readers during such events. But “the volume of traffic generated by scraper bots is unprecedented and poses increasing risks and costs,” Wikimedia said.

Human readers tend to search for specific and often similar topics. For example, multiple people search for the same thing when something is popular. Wikimedia creates a cache of content that has been requested multiple times in the data center closest to the user, which allows the content to be served faster. But articles and content that has not been accessed for some time must be served from the main data center, which consumes more resources and therefore costs Wikimedia more money. Because AI crawlers tend to read pages in bulk, they access lesser-known pages that should be served from the main data center.

Wikimedia reported that if you look closely, 65% of the resource-intensive traffic it receives comes from bots, which is already causing constant disruptions to the site's reliability team, who have to constantly block crawlers before they significantly slow down access to pages for real readers.

The real problem, Wikimedia says, is that "expansion has occurred largely without sufficient attribution, which is a key factor in attracting new users."

Wikimedia, which relies on donations from people to keep it going, needs to attract new users and make them care about its cause.

“Our content is free, our infrastructure is not,” Wikimedia said. The organization is now looking to create sustainable ways for developers and repeat users to access its content in the next fiscal year. This is necessary because it sees no sign that AI-related traffic is slowing down anytime soon.

Ukrainian Wikipedia for the first time in history has crossed the mark of 1 billion views per year