GenAI crawler problem highlights a bigger issue: the cloud bandwidth nightmare

Wednesday May 14, 2025. 12:00 PM , from ComputerWorld

It comes as no surprise that online scraper bots are scouring the web looking for content to train generative AI (genAI) models, causing massive bandwidth charges to enterprises who told them not to crawl their sites. What is surprising is the world of large language model (LLM) makers’ use of unattributed browsers and other means of escaping responsibility.

As bad as that situation is, something worse lies below it. And it’s an inequity that’s been around for decades, starting in the earliest days of the web.

The problem is simple, but the answer is not: enterprises have had business reasons to ignore the issue for years.

Since the earliest days of the web, hosting companies have charged enterprises for bandwidth based on usage. That seems fair enough. The problem is that those same companies have limited control over their bandwidth use — and limited finite budgets.

In other words, corporate budgets for bandwidth are based on typical activity. But then someone says something on social media that post goes viral, readers flock to a site in a massive way, and bandwidth costs soar. Are enterprises really now on the hook for an infinite amount of money?

Here’s where things get complicated. Businesses of all types tolerated this situation with the expectation that a big jump in traffic would generate big jump in revenue. So, they didn’t object to bandwidth cost increases.

Then came search engine spiders. (Note: Spiders, browsers and crawlers are interchangeable. They are also all bots.) Sure, they ate up bandwidth, but again, the assumption was that search traffic would be beneficial — it brought in customers and new prospects.

For the most part, search spiders respected robots.txt instructions about which sites they could visit and which pages on those sites they could crawl. Because search providers knew most sites welcomed their visits, they more or less respected the restrictions.

That brings us to today, when we find that the companies behind LLMs — through various sneaky mechanisms — do not respect those do not enter signs. And their crawlers don’t deliver the perceived value of human visitors or even search engine spiders. Instead of bringing with them new prospects to an enterprise site, they steal data, use it for their own apps and then sell it to others.

Website owners get no meaningful benefit and higher costs from increased bandwidth use. Most of the major model makers deny they do this, but that’s because they are using undeclared crawlers to do most of their dirty work. And, we detailed recently, they’re done in a way designed to cleverly avoid legal consequences.

There have been some efforts to address the problem. Cloudflare makes a popular one that essentially creates an attractive honey pot to keep unauthorized crawlers at bay.

But, to reiterate, the real problem is that companies generally have agreed to pay an infinite amount of money for bandwidth they can’t control. So, it’s hard now to fight something you’ve knowingly tolerated for decades.

If unauthorized crawlers were forced to pay those costs, the situation would likely resolve itself quickly. Or perhaps cloud vendors could charge for the bandwidth. Conveniently, many of the large cloud companies — think Amazon, Google and Microsoft — also happen to own the operations sending out cowboy crawlers. Isn’t that special?

More importantly, doesn’t that create a massive conflict of interest?

The problem will be difficult to fix. Most of the obvious mechanisms are untenable. A site could, for example, say they are willing to spend X number of dollars on bandwidth and no more. But what happens when a site hits that number? Would Walmart or Chase Bank really say, “Turn off the bandwidth hose until next month?”

Of course not.

That brings us to an attribution problem. An enterprise knows that its bandwidth numbers are soaring by a certain percent above normal. But during that time, they were visited by millions of humans and a greater number of bots from all kinds of companies, including search and genAI crawlers.

Most sites’ analytics struggle to attribute specific bandwidth increases to specific visitors. And even for those that do, the biggest violators are going to be undeclared bots, or bots not easily identified with a specific company. Sometimes the bots come from countries such as China, Russia and North Korea that rarely play nice with US laws.

Still, enterprise IT needs to have some serious conversations with hosting vendors — or with business partners that handle those arrangements — about getting unauthorized bandwidth charges under control. Given various reports that bots represent more traffic than humans today, that conversation needs to happen soon.