Ai Crawlers And Fetchers Are Blowing Up Websites, With Meta And Openai The Worstoffenders

Updated Cloud services giant Fastly has released a report claiming AI crawlers are putting a heavy load on the open web, slurping up sites at a rate that accounts for 80 percent of all AI bot traffic, with the remaining 20 percent used by AI fetchers. Bots and fetchers can hit websites hard, demanding data from a single site in thousands of requests per minute.

I can only see one thing causing this to stop: the AI bubble popping

According to the report [PDF], Facebook owner Meta’s AI division accounts for more than half of those crawlers, while OpenAI accounts for the overwhelming majority of on-demand fetch requests.

Cloudflare creates AI crawler tollbooth to pay publishers

“AI bots are reshaping how the internet is accessed and experienced, introducing new complexities for digital platforms,” Fastly senior security researcher Arun Kumar opined in a statement on the report’s release. “Whether scraping for training data or delivering real-time responses, these bots create new challenges for visibility, control, and cost. You can’t secure what you can’t see, and without clear verification standards, AI-driven automation risks are becoming a blind spot for digital teams.”

The company’s report is based on analysis of Fastly’s Next-Gen Web Application Firewall (NGWAF) and Bot Management services, which the company says “protect over 130,000 applications and APIs and inspect more than 6.5 trillion requests per month” – giving it plenty of data to play with. The data reveals a growing problem: an increasing website load comes not from human visitors, but from automated crawlers and fetchers working on behalf of chatbot firms.

The report warned, “Some AI bots, if not carefully engineered, can inadvertently impose an unsustainable load on webservers,” Fastly’s report warned, “leading to performance degradation, service disruption, and increased operational costs.” Kumar separately noted to The Register, “Clearly this growth isn’t sustainable, creating operational challenges while also undermining the business model of content creators. We as an industry need to do more to establish responsible norms and standards for crawling that allows AI companies to get the data they need while respecting websites content guidelines.”

That growing traffic comes from just a select few companies. Meta accounted for more than half of all AI crawler traffic on its own, at 52 percent, followed by Google and OpenAI at 23 percent and 20 percent respectively. This trio then has its hands on a combined 95 percent of all AI crawler traffic. Anthropic, by contrast, accounted for just 3.76 percent of crawler traffic. The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.

The story flips when it comes to AI fetchers, which unlike crawlers are fired off on-demand when a user requests that a model incorporates information newer than its training cut-off date. Here, OpenAI was by far the dominant traffic source, Fastly found, accounting for almost 98 percent of all requests. That’s an indication, perhaps, of just how much of a lead OpenAI’s early entry into the consumer-facing AI chatbot market with ChatGPT gave the company, or possibly just a sign that the company’s bot infrastructure may be in need of optimization.

While AI fetchers make up a minority of Ai bot requests – only about 20%, says Kumar – they can be responsible for huge bursts of traffic, with one fetcher generating over 39,000 requests per minute during the testing period. “We expect fetcher traffic to grow as AI tools become more widely adopted and as more agentic tools come into use that mediate the experience between people and websites,” Kumar told The Register.

Perplexity AI, which was recently accused of using IP addresses outside its reported crawler ranges and ignoring robots.txt directives from sites looking to opt out of being scraped, accounted for just 1.12 percent of AI crawler bot and 1.53 percent of AI fetcher bot traffic recorded for the report – though the report noted that this is growing.

Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content

Kumar decried the practice of ignoring robots.txt notes, telling El Reg, “At a minimum, any reputable AI company today should be honoring robots.txt. Further and even more critically, they should publish their IP address ranges and their bots should use unique names. This will empower site operators to better distinguish the bots crawling their sites and allow them to enforce granular rules with bot management solutions.”

But he stopped short of calling for mandated standards, saying that industry forums are working on solutions. “We need to let those processes play out. Mandating technical standards in regulatory frameworks often does not produce a good outcome and shouldn’t be our first resort.”

It’s a problem large enough that users have begun fighting back. In the face of bots riding roughshod over polite opt-outs like robots.txt directives, webmasters are increasingly turning to active countermeasures like the proof-of-work Anubis or gibberish-feeding tarpit Nepenthes, while Fastly rival Cloudflare has been testing a pay-per-crawl approach to put a financial burden on the bot operators. “Care must be exercised when employing these techniques,” Fastly’s report warned, “to avoid accidentally blocking legitimate users or downgrading their experience.”

Kumar notes that small site operators, especially those serving dynamic content, are most likely to feel the effects most severely, and he had some recommendations. “The first and simplest step is to configure robots.txt which immediately reduces traffic from well-behaved bots. When technical expertise is available, websites can also deploy controls such as Anubis, which can help reduce bot traffic.” He warned, however, that bots are always improving and trying to find ways around “tarpits” like Anubis, as code-hosting site Codeberg recently experienced. “This creates a constant cat and mouse game, similar to what we observe with other types of bots today,” he said.

We spoke to Anubis developer Xe Iaso, CEO of Techaro. When we asked whether they expected the growth in crawler traffic to slow, they said: “I can only see one thing causing this to stop: the AI bubble popping.

“There is simply too much hype to give people worse versions of documents, emails, and websites otherwise. I don’t know what this actually gives people, but our industry takes great pride in doing this.”

However, they added: “I see no reason why it would not grow. People are using these tools to replace knowledge and gaining skills. There’s no reason to assume that this attack against our cultural sense of thrift will not continue. This is the perfect attack against middle-management: unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees. I see no reason that this will continue to grow until and unless the bubble pops. Even then, a lot of those scrapers will probably stick around until their venture capital runs out.”

Regulation – we’ve heard of it

The Register asked Xe whether they thought broader deployment of Anubis and other active countermeasures would help.

Anubis guards gates against hordes of LLM bot crawlers

They responded: “This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming. Ironically enough, most of these AI companies rely on the communities they are destroying.

“This presents the kind of paradox that I would expect to read in a Neal Stephenson book from the ’90s, not CBC’s front page. Anubis helps mitigate a lot of the badness by making attacks more computationally expensive. Anubis (even in configurations that omit proof of work) makes attackers have to retool their scraping to use headless browsers instead of blindly scraping HTML.”

And who is paying the piper?

“This increases the infrastructure costs of the AI companies propagating this abusive traffic. The hope is that this makes it fiscally unviable for AI companies to scrape by making them have to dedicate much more hardware to the problem. In essence: it makes the scrapers have to spend more money to do the same work.”

We approached Anthropic, Google, Meta, OpenAI, and Perplexity but none provided a comment on the report by the time of publication. ®

Updated to add:

Will Allen, VP, Product at Cloudflare commented on the findings, saying Cloudflare’s observations were “reasonably close” to Fastly’s claim, “and the nominal difference could potentially be due to a difference in customer mix.” Allen added that, looking at its own AI Bot & crawler traffic by crawl purpose, for April 15 – July 14), Cloudlfare could show that 82.7 percent is “for training — this is the equivalent of ‘AI crawler’ in Fastly’s report.”

Asked whether the growth in crawler traffic was likely to continue, Allen responded: “We don’t see any material slowdowns in the near term horizon – the desire for content currently seems insatiable.”

He opined: “All of our work around AI crawlers is anchored on a radically simple philosophy: content creators and website owners should get to decide how their content and data is used for commercial purposes when they put it online. Some of us want to write for the superintelligence. Others want a direct connection and to create for human eyes only.”

Asked how he suggested site operators reduce the burden of this traffic on their infrastructure, he naturally pitched the vendor’s own wares, saying “Cloudflare makes it incredibly easy to take control, even for our free users: you can decide to let everyone crawl you, or with one click block AI Crawlers from training and deploy our fully managed robots.txt.”

He said of the vendor’s AI labyrinth that it was “a first iteration of using generative AI to thwart bots for us, and generates valuable data that feeds into our bot detection systems. We don’t see this as a final solution, but rather a fun use of technology to trap misbehaving bots.”

Original Source

A considerable amount of time and effort goes into maintaining this website, creating backend automation and creating new features and content for you to make actionable intelligence decisions. Everyone that supports the site helps enable new functionality.

If you like the site, please support us on “Patreon” or “Buy Me A Coffee” using the buttons below

Buy Me A Coffee

Patreon

To keep up to date follow us on the below channels.

Tags: cybersecurity, OSINT, threatintel

Ai Crawlers And Fetchers Are Blowing Up Websites, With Meta And Openai The Worstoffenders

Cloudflare creates AI crawler tollbooth to pay publishers

Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content

Regulation – we’ve heard of it

Anubis guards gates against hordes of LLM bot crawlers

Updated to add:

Tech Industry Grad Hiring Crashes 46% As Bots Do Junior Work

[EVEREST] – Ransomware Victim: MUSE-INSECURE: Inside Collins Aerospaces Security Failure

[EVEREST] – Ransomware Victim: Collins Aerospace / RTX[.]com

Cobalt Strike Beacon Detected – 47[.]109[.]90[.]134:88

Cobalt Strike Beacon Detected – 8[.]210[.]78[.]137:81

Cloudflare creates AI crawler tollbooth to pay publishers

Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content

Regulation – we’ve heard of it

Anubis guards gates against hordes of LLM bot crawlers

Updated to add:

You may have missed

Tech Industry Grad Hiring Crashes 46% As Bots Do Junior Work

[EVEREST] – Ransomware Victim: MUSE-INSECURE: Inside Collins Aerospaces Security Failure

[EVEREST] – Ransomware Victim: Collins Aerospace / RTX[.]com

Cobalt Strike Beacon Detected – 47[.]109[.]90[.]134:88

Cobalt Strike Beacon Detected – 8[.]210[.]78[.]137:81