Реклама партнера — Название партнёра
UNIT.City — місце, де люди працюють... КРАЩЕ! Обирай свій простір просто зараз 👉

Cloudflare explained the cause of the failure: one configuration file "broke" the network around the world

On November 18, a Cloudflare network outage knocked out thousands of websites around the world and in Ukraine for several hours. Now the company has released a report and admitted that it was not a hacker attack, but an internal bug.

Leave a comment
Cloudflare explained the cause of the failure: one configuration file "broke" the network around the world

On November 18, a Cloudflare network outage knocked out thousands of websites around the world and in Ukraine for several hours. Now the company has released a report and admitted that it was not a hacker attack, but an internal bug.

According to Cloudflare’s technical report, the incident started at 11:20 UTC. While changing access rights to the ClickHouse database, which generates the configuration file for the Bot Management system, the file mistakenly included twice the number of bot «signs». The file exceeded the size limit, the bot traffic check module was unable to process it and began to massively return 5xx errors to users instead of sites.

Since this file is updated every few minutes and instantly distributed to all servers, the failure quickly became global. Some data centers managed to download the correct version, others the «broken» one, so the network sometimes crashed, then partially recovered. Only at 14:30 UTC did engineers stop the distribution of the incorrect file, replace it with a stable version and restart the proxy; completely normal network operation was restored at 17:06.

Cloudflare emphasizes that the incident was not a DDoS attack. At first, the team suspected it, the day before the company repelled record attacks by the Aisuru botnet, and during the failure, an independent status site also stopped opening. However, analysis of the logs showed that the «culprit» was an internal configuration file that was incorrectly formed after changing the access settings.

The outage affected not only the main CDN, but also other products: Turnstile (human-bot check) did not work, login to the control panel was broken, errors in Workers KV and the Access service increased. After replacing the file, the company spent several hours restarting modules and clearing the queue of stuck requests that were crashing some of the internal APIs.

Cloudflare has already promised to tighten system configuration checks, add a «kill switch» for risky features, and prevent logging services from hogging all server resources during errors. The company’s CEO called the outage «the worst of 2019» and apologized to customers for the fact that the network was temporarily unable to perform its basic function of routing traffic.

Previously, dev.ua wrote about the global outage of the Cloudflare service causing disruptions in the work of fiscalization services in Ukraine. We are talking about software-based PPOs, which are used by many Ukrainian businesses.

Ukrainian Ucloud does not rule out that a large-scale failure at Cloudflare could have occurred due to the "human factor"
Ukrainian Ucloud does not rule out that a large-scale failure at Cloudflare could have occurred due to the «human factor»
On the topic
Ukrainian Ucloud does not rule out that a large-scale failure at Cloudflare could have occurred due to the «human factor»
UPD. Cloudflare said the problem was resolved. The company attributed the outage to an "unusual" spike in traffic.
UPD. Cloudflare said the issue was resolved. The company attributed the outage to an «unusual» spike in traffic.
On the topic
UPD. Cloudflare said the issue was resolved. The company attributed the outage to an «unusual» spike in traffic.
“Everyone puts all their eggs in one basket and then they’re surprised when there’s a problem.” Catchpoint CEO urges companies to take better care of reliability after Cloudflare outage
«Everyone puts all their eggs in one basket and then they’re surprised when there’s a problem.» Catchpoint CEO urges companies to take better care of reliability after Cloudflare outage
On the topic
«Everyone puts all their eggs in one basket and then they’re surprised when there’s a problem.» Catchpoint CEO urges companies to take better care of reliability after Cloudflare outage
Read the country's main IT news in our Telegram
Read the country’s main IT news in our Telegram
On the topic
Read the country’s main IT news in our Telegram

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

Discussion
No comments yet.