Cloudflare offers simpler way to stop AI bots

Thursday July 4, 2024. 10:41 PM , from ComputerWorld

Content distribution network Cloudflare is making it simpler for customers who have had enough of badly behaved bots to block them from their website.

It’s long been possible to prevent well-behaved bots from crawling your corporate website by adding a “robots.txt” file listing who’s welcome and who isn’t — and content distribution networks such as Cloudflare offer visual interfaces to simplify the creation of such files.

But faced with the arrival of a new generation of badly behaved AI bots, scraping content to feed their large language models (LLMs), Cloudflare has introduced an even quicker way to block all such bots with one click.

“The popularity of generative AI has made the demand for content used to train models or run inference on skyrocket, and although some AI companies clearly identify their web scraping bots, not all AI companies are being transparent,” Cloudflare staff wrote in a blog post.

According to authors of the post, “Google reportedly paid $60 million a year to license Reddit’s user generated content, Scarlett Johansson alleged OpenAI used her voice for their new personal assistant without her consent, and most recently, Perplexity has been accused of impersonating legitimate visitors in order to scrape content from websites. The value of original content in bulk has never been higher.”

Last year, Cloudflare introduced a way for any of its customers, on any plan, to block specific categories of bots, including certain AI crawlers. These bots, said Cloudflare, observe requests in sites’ robots.txt files, and do not use unlicensed content to train their models, nor gather to feed for retrieval-augmented generation (RAG) applications.

To do this it identifies bots by their “user-agent string” — a kind of calling card presented by browsers, bots and other tools requesting data from a web server.

“Even though these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them. We hear clearly that customers do not want AI bots visiting their websites, and especially those that do so dishonestly,” the post said.

The top four AI webcrawlers visiting sites protected by Cloudflare were Bytespider, Amazonbot, ClaudeBot and GPTBot, it said. Bytespider, the most frequent visitor, is operated by ByteDance, the Chinese company that owns TikTok. It visited 40.4% of protected websites, and is reportedly used to gather training data for its LLMs, including those that support its ChatGPT rival Doubao. Amazonbot is reportedly used to index content to help Amazon’s Alexa’s chatbot answer questions, while ClaudeBot gathers data for Anthropic’s AI assistant Claude.

Blocking bad bots

Blocking bots based on their user-agent string will only work if such bots tell the truth about their identity — but there are signs that not all do, or not all the time.

In such cases, other measures will be necessary — and enterprises’ main recourse against unwanted web scraping is normally reactive: pursue legal action, according to Thomas Randall, director of AI market research at Info-Tech Research Group.

“While some software applications exist for web scraping prevention (such as DataDome and Cloudflare), these can only go so far: if an AI bot is rarely scraping a site, the bot may still go undetected,” he said via email.

To justify legal action against the operators of bad bots, enterprises will need to do more than claim that the bot didn’t leave when asked.

The best course of action, Randall said, is for “enterprises to hide intellectual property or other important information behind a membership paywall. Any scraping done behind the paywall is liable for legal action, reinforced with a clear restrictive copyright license on the site. The organization must, therefore, be prepared to legally follow through. Any scraping done on the public site is accepted as part of the organization’s risk tolerance.”

Randall noted that if organizations have the resources to go further, they could consider rate-limiting connections to their site, temporarily automatically blocking suspicious IP addresses, limiting information on why access has been blocked to a message such as “For help, contact support via helpdesk@company.com” in order to force a human interaction, and double-checking how much of their websites are available on their mobile site and apps.

“Ultimately, scraping cannot be stopped, but hindered at best,” he said.

More on AI bots and data scraping:

‘Data poisoning’ anti-AI theft tools emerge — but are they ethical?

Zoom goes for a blatant genAI data grab; enterprises, beware (updated)

Expect to see more online data scraping, thanks to a misinterpreted court ruling