Attack of the AI crawlers

Monday May 5, 2025. 01:00 PM , from ComputerWorld

Enterprise IT leaders — and their counterparts in Legal and Compliance — have many reasons to hate having their websites visited by genAI model makers’ agents, whether they are called bots, crawlers, or spiders.

They could object to their IP being stolen and used to train genAI models with almost no benefit to their company. They might be infuriated by copyright and trademark violations and the exposure of their customers’ and employees’ personally identifiable information to the world of thieves.

But most of all, they are being hit with massive bills from their web hosts for soaring bandwidth usage — even though many have used standard web mechanisms (robots.txt files, for starters) to tell the genAI crawlers, “Do not enter.”

[ Related: Inside the war between genAI and the internet ]

For various technical reasons, the law offers no meaningful remedy. There are a small number of vendors willing to sell companies software to halt the forbidden traffic — which itself is potentially problematic, as it could also halt search engine crawlers.

The key question here is “Why do the genAI model makers deploy bots that ignore the robots.txt files?” The answer is tricky — and the model makers are even trickier.

Oh, that? That isn’t my bot

Most of the major model makers contacted by Computerworld said that they respect the restrictions and that their crawlers do not go where they are not wanted.

One AWS executive, who asked that his name not be used, said that Amazon respects the rules and that “this is aligned with our responsible AI approach.”

Anthropic has a page dedicated to explaining why its behavior is always above-board and explicitly says “Anthropic’s Bots respect ‘do not crawl’ signals by honoring industry standard directives in robots.txt.”

But, industry observers argue, the trick is that the model makers are referencing only their officially named crawlers. Most also deploy — or have third parties deploy on their behalf — undeclared crawlers. And it is the undeclared crawlers that tend to go wherever they want and do whatever they want. (See also: IETF hatching a new way to tame aggressive AI website scraping.)

Reid Tatoris, senior director of product at Cloudflare, a vendor that dubs itself a connectivity cloud company, said the number of undeclared genAI crawlers is soaring.

“Our data shows that 30-40% of the AI crawling activity we see comes from undeclared crawlers that don’t announce their user agent,” Tatoris said. “We expect this number to grow over time as more websites block declared crawling and as the number of AI crawlers continues to explode.”

This tactic allows genAI model makers to proudly declare that they are following the rules while they (directly or indirectly) use undeclared crawlers go renegade, by rotating IP addresses and pretending to be something else.

Dennis Schubert is a business consultant and SEO specialist in Berlin who has been closely tracking the genAI crawler situation. He said he has observed many of the genAI crawlers violating the rules, but not all of them.

“I observed Microsoft — the Bing bot, specifically — but I don’t think I’ve ever seen IBM or DeepSeek,” Schubert said. “But I only ever looked at the ‘top user agent,’ so if those only do a few requests with large time spacing, I wouldn’t notice.”

Noah Susskind, general counsel at AI risk vendor StackAware, said that, as a lawyer, he has been impressed with the double standard that the model makers deploy when it comes to legal protections.

“GenAI vendors treat their terms of service as God’s own words, but they ignore robots.txt” on anyone else’s site, Susskind said.

To be fair, it’s not entirely clear that robots.txt directives are legally enforceable, according to Susskind and other attorneys who focus on technology issues. Therefore, if the model makers were arguing that they have the right to violate those requests, that might be a legitimate argument. But that is not what they are arguing. They are saying they abide by those rules, but then many send out undeclared crawlers to do it anyway.

The real problem is that they are inflicting financial damage to the site owners by forcing them to pay far more for bandwidth. And it is solely the model makers that benefit, not the site owners.

What is IT to do, Susskind asked, when an undeclared genAI crawler “hits my site a million times a day”? Indeed, Susskind’s team has seen “a single bot hitting a site millions of times per hour. That is several orders of magnitude more burdensome than normal SEO crawling.”

Cloudflare offers its customers a service that diverts these crawlers away from a site by feeding it legitimate but irrelevant content to keep it busy. The vendor’s different Application Services plans include varying levels of bot mitigation features — for example, the $200-per-month Business plan protects against sophisticated bots and offers basic bot analytics. (The company does not disclose pricing for its Enterprise plan, which offers more advanced bot analytics and protections.)

One problem that the firm has encountered is when sites want to allow search engine crawlers but block genAI crawlers, Tatoris said. That is easily accomplished in most cases, but “the Google bot is a tricky one, a challenging one right now” because it’s difficult if not impossible to distinguish between the Google search engine crawler and the Google genAI crawler, Tatoris said.

Consultant Schubert said he doesn’t have a good answer on how to protect web assets from AI crawlers. “A lot of people do the ‘let’s use an LLM to generate trash content to feed trash to the training robots’ [tactic], and while I guess that works, I’m not a huge fan,” he said. “That’s effectively wasting energy to allow someone else to waste energy. Ideally, we’d have clear legislation and judge decisions telling those companies that what they do is not fair use.”

Little help from the law

In a vacuum, this situation would be ideal for a class-action lawsuit because there are lots of victims and the damages are relatively easy to quantify. The web host firm could list typical bandwidth costs for a site before the genAI crawler visits and afterwards. (See also: Inside the war between genAI and the internet.)

The problem, according to attorneys in this space, is not with establishing monetary damages but with attribution: how to determine who’s responsible for the surging traffic.

In such a hypothetical court case, the lawyers for the deep-pocketed genAI model makers would likely argue that plaintiffs’ sites are visited by millions of users and bots from multiple sources. Without proof tying traffic to a specific crawler or tying a crawler to a specific model maker, the model maker can’t be held accountable for plaintiffs’ financial damages.

For many sites, web analytics are simply not able to precisely quantify how much bandwidth is attributed to one particular visitor. Some specialty services claim to be able to do that, but those mean additional costs.

Michael Isbitski, principal application security architect for genAI at ADP, sees the problem as terribly difficult for IT leaders to fix on their own. “Attribution is absolutely hard, especially when [undeclared crawlers] deploy obscured IP addresses,” he said.

Ian Poynter, a member of the board of advisors for Humma.AI and former CEO of Kalahari Security, has also been watching this genAI bot activity. IT departments typically “do not have logs that are detailed or correlated enough” to pinpoint crawler traffic, Poynter said.

As for the legal challenges, he argued that the courts have yet to meaningfully address the issue. “Lawyers love precedents and courts love precedents. And the precedents haven’t yet been set,” Poynter said.

B. Stephanie Siegmann, a partner with the Boston law firm Hinckley Allen, agreed. “In the cyber arena, the laws haven’t kept up,” said Siegmann, who specializes in technology issues.

One of the most problematic factors is that the model makers are overwhelmingly massive companies with gigantic legal war chests.

“I think these big companies are just daring someone to sue them. Somebody [in their legal department] must have said, ‘It’s fine. Ignore the robots.txt. That is for other people, not us,’” Poynter said.

Instead of calling them bots, crawlers, or spiders, Poynter suggested his own name: “A better term would be leeches. [The model makers] think that ‘If there are enough of us doing this, we can get away with it.’”

Correction: The last two paragraphs of this story have been updated to identify Ian Poynter as the speaker; they were originally attributed to a different source.