Inside the war between genAI and the internet

Friday March 28, 2025. 11:00 AM , from ComputerWorld

Generative AI (genAI) companies are starting to do real damage to the internet.

One of the internet’s main purposes is to serve as a global network for free and open communication and information exchange between scientists, academics, and the public and to be an uncensorable place for the expression of free speech.

(One of the most dangerous threats to the internet is recent bipartisan support for repealing Section 230 of the Communications Decency Act, which, if actually repealed, would seriously harm free speech online. That’s an issue you can read about on the EFF website.)

The purest expression of the internet’s purpose is the world of Open Access (OA) websites. These are sites that provide free and unrestricted access to scholarly information such as research articles, books, data, and educational resources. Open Access allows users to get content without technical barriers. It provides legal permissions for reading, downloading, copying, distributing, and reusing content with proper attribution. And it’s part of the broader Open Science movement.

But now, OA sites are under attack. AI bots, or AI crawlers, constantly scanning for data to add to training data sets for genAI chatbots and related services, are overwhelming OA websites and others, straining resources and leading to outages.

Of course, there are many different kinds of bots, which collectively generate more traffic on the internet than humans. DesignRush says that bots now account for 80% of all web visits.

Bot types include search engine bots, SEO and analytics bots, social media bots, malicious bots, and web scraping bots.

But AI crawlers are by far the fastest-growing kind of bot. According to DesignRush, the crawlers from one company — OpenAI’s GPT bots — now account for about 13% of all web traffic and make hundreds of millions of requests per month.

Their mission is to take data and essentially replace the original source. For example, instead of using Google to find scientific articles on a subject, the AI crawlers seek to take those articles and present a new “article” for the user cobbled together from many articles and many sites, incentivizing the user to ignore the source sites and get their information from the chatbots.

To oversimplify the problem, harvesting more data from OA sites makes chatbots faster and more convenient to use. However, the harvesting itself makes the OA sites slower and harder to use.

While much digital ink has been spilled decrying the taking of content, it’s also important to know that the chatbot companies are overwhelming many of the sites they’re copying content from, much like a daily DDOS attack.

Different kinds of bots affect different types of websites indifferent ways, but they can have a huge impact on OA sites.

Fighting back

Cloudflare is now deliberately poisoning large language model (LLM) training data, fighting back against the AI companies that are taking data from websites without permission. (The company offers content delivery networks, cybersecurity, DDoS mitigation, and web performance optimization.)

Here’s the problem Cloudflare is trying to solve: Companies like OpenAI, Anthropic, and Perplexity have been accused of harvesting data from websites, ignoring robots.txt files on the sites (originally designed to tell search engines which files were off-limits for indexing), and taking data anyway. In addition to these big names, all kinds of smaller, less legitimate companies are capturing data without permission from the rightful owners.

Cloudflare’s solution is a feature available to all customers called “AI Labyrinth.” The program redirects incoming bots to its own special-purpose websites, which are filled with huge quantities of factually accurate but irrelevant (irrelevant to the target website) AI-generated information.

In addition to wasting the time of the companies in control of the bots, AI Labyrinth is also a honeypot, enabling Cloudflare to add those companies to a blacklist.

The idea is somewhat similar to the “Nightshade” project from the University of Chicago; it was designed to protect artists’ work by poisoning image data. The project enabled digital image artists to download Nightshade for free and convert the pixels of their artwork in a way that made people see the same image but AI models to completely misread what the pictures looked like.

One way to stop AI crawlers is via good old-fashioned robots.txt files, but as noted, they can and often do ignore those. That’s prompted many to call for penalties such as infringement lawsuits, for doing so.

Another approach is to use a Web Application Firewall (WAF), which can block unwanted traffic, including AI crawlers, while allowing legitimate users to access a site. By configuring the WAF to recognize and block specific AI bot signatures, websites can theoretically protect their content. More advanced AI crawlers might evade detection by mimicking legitimate traffic or using rotating IP addresses. Protecting against this is time-consuming, forcing the frequent updating of rules and IP reputation lists — another burden for the source sites.

Rate limiting is also used to prevent excessive data retrieval by AI bots. This involves setting limits on the number of requests a single IP can make within a certain timeframe, which helps reduce server load and data misuse risks.

Advanced bot management solutions are becoming more popular, too. These tools use machine learning and behavioral analysis to identify and block unwanted AI bots, offering more comprehensive protection than traditional methods.

Lastly, advocacy and policy changes are being developed to make sure content creators have more control over how their work is used.

In the meantime, something needs to be done about the impact of AI crawlers on OA websites, which offer some of the best sources of information on the internet both to people and to LLM-based chatbots.

While the legality or acceptability of simply taking content is argued online, in the courts and in government, we can’t let those same companies essentially sabotage, attack, and crush the same sites they’re taking from while the debate rages on.