Олександр Кузьменко AI Eng 17 April 2025, 13:48

Wikipedia has released a special dataset to discourage bots that collect information for AI training

Wikipedia has released a set of structured data specifically optimized for training artificial intelligence models, in an attempt to discourage AI developers from using bots to collect information that burdens its servers.

Leave a comment

Wikipedia has released a special dataset to discourage bots that collect information for AI training

Wikipedia has released a set of structured data specifically optimized for training artificial intelligence models, in an attempt to discourage AI developers from using bots to collect information that burdens its servers.

The Wikimedia Foundation has announced a partnership with Kaggle, a Google-owned data science community that hosts machine learning data. The collaboration has released a beta version of a dataset of «Wikipedia structured content in English and French», The Verge reports.

It is noted that this dataset is «designed with machine learning workflows in mind,» making it easier for AI developers to access article data for modeling, fine-tuning, benchmarking, and more. The content of the dataset is open-licensed and currently includes research summaries, short descriptions, image links, infobox data, and article sections (excluding links or non-text elements such as audio files).

Well-structured JSON representations of Wikipedia content, available to Kaggle users, should become a more attractive alternative to «mining or parsing the raw text of an article,» Wikimedia says, a problem that currently puts a strain on Wikipedia’s servers as automated AI bots relentlessly consume the platform’s bandwidth.

Since January 2024, Wikimedia has increased the bandwidth of the channel used to download multimedia content by 50%. According to the foundation, up to 65% of the resource-intensive traffic it receives comes from bots collecting data for AI. This is already causing difficulties in maintaining the reliability of the site; the project’s IT team has to constantly block crawlers before they significantly slow down access to pages for real readers.

«Our content is free, our infrastructure is not», — Wikimedia noted.

The foundation already has content sharing agreements with Google and the Internet Archive, but the partnership with Kaggle should make this data more accessible to smaller companies and independent data researchers.

«As the go-to place for the machine learning community to find tools and tests, Kaggle is extremely excited to host the Wikimedia Foundation data,» said Brenda Flynn, Kaggle’s head of partnerships.

Read the country's main IT news in our Telegram

A developer from the US “crossed” Wikipedia with TikTok using AI. Now WikiTok’s endless stream of useful articles cures users of boredom and addiction to algorithms

"Our content is free, our infrastructure is not." Wikipedia fights voracious scraper bots that collect information to train AI models

SIDE AI program helps verify sources on Wikipedia and increases the veracity of articles

Leave a comment

Text: Олександр Кузьменко Photo: Otio Source: The Verge Tags: ai, wikipedia

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment