OpenAI Strikes Publisher Deals to Reduce AI Web Scraping Tensions

Thursday October 31, 2024. 05:23 PM , from eWeek

As the artificial intelligence company behind ChatGPT, OpenAI has a seemingly never-ending hunger for data on which to train its popular model, much of which it gets by scraping the web. Now the company has begun negotiating licensing deals with media producers to harvest their online content as a way to skirt the moral, ethical, and legal questions surrounding the process of web scraping.

These negotiations are taking place in an evolving Internet landscape in which the value of media content has increased dramatically, with articles, images, and video making up much of the must-have fuel for the generative AI tech boom. Disagreements about how AI developers access media content are subject to a titanic push-pull involving thorny legal issues, complex technology, and large sums of money. Interested parties on both sides of the issue are watching closely to see whether the company’s new approach will prove successful, and whether it might be worth replicating.

KEY TAKEAWAYS

•OpenAI and other generative AI developers have built their models by aggressively scraping the web in a legally ambiguous manner. (Jump to Section)
•Having prompted numerous lawsuits with unauthorized web scraping, OpenAI is now transitioning to licensing negotiations with producers. (Jump to Section)
•Issues about AI web crawling remain unresolved, and the outcome of this issue has far-reaching implications for the future of the media industry. (Jump to Section)

TABLE OF CONTENTS
ToggleIs AI Web Scraping Legal?How Does AI Web Scraping Technology Work?How is the Business of AI Web Scraping Changing?Are Content Producers Negotiating Their Own Demise?Bottom Line: AI Content Licensing Has Far Reaching Implications

Is AI Web Scraping Legal?

OpenAI debuted ChatGPT in November 2022 to much publicity and attention. As a raft of competing apps followed, many in the media began to ask an obvious question: Where did these AI developers obtain the ocean of data needed to feed and train their models? The answer, of course, is that generative AI companies are aggressive—some might say reckless—web scrapers. Their hungry bots travel the Internet day and night, pulling information.

Having invested heavily in their content, content producers—including writers, artists, bloggers, musicians, and many media outlets—feel a deep sense of ownership about that content. Amid questions about copyright, a tornado of lawsuits is now pending, including a few high profile cases:

Getty Images vs. Stability AI: Photographer collective and stock image repository Getty Images alleges that Stability AI infringed on more than 12 million photographs, including their captions and metadata.

New York Times vs. Microsoft: The New York Times alleges that millions of pieces of its content were used to build the large language models of Microsoft’s Copilot and OpenAI’s ChatGPT. Microsoft is a major investor in OpenAI and is entitled to a share of the profits from the for-profit division of OpenAI.

Concord Music Group, Inc vs. Anthropic PBC: Several major music publishers allege that Anthropic used lyrics to train the Claude LLM, and that Anthropic removed CMI (copyright management information) from this material.

In the wake of legal action, OpenAI has signed deals with approximately a dozen publishers, including Vox, The Atlantic, Dotdash Meredith, which publishes numerous tech, finance, and health publications; and Condé Nast, which publishes Wired, The New Yorker, and Vanity Fair. These deals appear to be an acknowledgement that it’s time to change its AI web crawling process.

After all, building generative AI apps offers stunning potential revenue: OpenAI is now valued at $157 billion. With that much money at stake, it’s no surprise the company decided that signed contracts are a better strategy for long term growth.

How Does AI Web Scraping Technology Work?

Publishers are increasingly using blocking technology to prevent AI web scrapers from accessing their content without permission. The tool that sites favor for this blocking action is the Robots Exclusion Protocol, or robots.txt, which can be set to exclude any AI web crawler by name—including OpenAI’s GPTBot.

Reuters Institute for the Study of Journalism reported that, by the end of 2023, 48 percent of high-profile sites were blocking GPTBot, while 24 percent were blocking Google’s AI web crawler. News sites were much more likely to block GPTBot than other popular sites, with 79 percent in the U.S. compared to just 20 percent in foreign markets. Legacy print news publications were more likely to block GPTBot than were born-on-the-web news outlets.

Data from Canadian AI detection company Originality AI suggests that OpenAI’s success in dealmaking is lowering the level of blocking. In 2024 the percentage of top national news sites that block GPTBot has fallen from a high of nearly 90 percent to just above 50 percent.

Hordes of AI bots continue to crawl the web, and AI developers that need data relentlessly harvest media with or without permission. Some website owners say there are so many AI bots crawling their sites that it’s having the same effect as a DDoS attack. This constant crowd of additional non-human traffic increases a site’s hosting costs and may even slow load times.

How is the Business of AI Web Scraping Changing?

As a sign of the potential revenue in the new market for content, web networking and security firm Cloudflare recently announced plans to launch a market that enables online content producers to sell access to AI web crawlers. To support this market, the company debuted a free tool called AI Audit, which allows websites to control how their pages are consumed by AI model developers.

The tool will include a feature that enables content producers to set their own price for AI vendors and features a metrics dashboard that shows site owners how often certain pages are crawled, which could support their negotiating efforts.

While Cloudflare has gotten out front with its marketplace plans, it’s likely that there will be other efforts to standardize the licensing of content to AI web crawlers. Small niche sites with deep content troves will find this to be particularly useful, since they lack on-staff finance professionals to negotiate licensing terms.

Are Content Producers Negotiating Their Own Demise?

That licensing negotiations are occurring simultaneously with pending lawsuits suggests that the process has multiple moving parts, and no lack of confusion. It’s obvious what’s in it for AI vendors—access to a rich vein of content—but content producers have not publicly disclosed how they benefit from licensing content to AI vendors.

In a companywide email, Condé Nast CEO Roger Lynch referred to “ongoing turmoil within the publishing industry while discussing the deal,” and said that changes to Google search have created new revenue challenges for publishers.

“Our partnership with OpenAI begins to make up for some of that revenue, allowing us to continue to protect and invest in our journalism and creative endeavors,” Lynch wrote.

Presumably this means there will be some form of profit-sharing from licensing deals. These deals, however, create a direct feed of the most current, highest quality content from premier media organizations into generative AI models. The result is that these AI tools will grow correspondingly more sophisticated, timely, and even stylish in their output.

At the moment, the balance of power is shifting toward content producers, with lawsuits that may force AI vendors to pay huge sums. However, future iterations of AI bots will mature to an astounding functionality, given that LLMs mature on an exponential growth curve that leaps forward in ever shorter time spans, fueled by the latest processor chips. This means that, ironically, generative AI platforms will more aggressively compete with the content producers now licensing their material thanks to their content.

Bottom Line: AI Content Licensing Has Far Reaching Implications

The many issues involving AI web scraping are in a period of rapid flux, with legal, licensing and technical issues all evolving at the same time—and with each playing an important role in shaping the media and AI relationship. The key driving force in this issue: scarcity of quality data. Industry insiders like to say that generative AI developers are “running out of data to crawl,” which means that all the public, freely available content has already been crawled—and quite likely some copyrighted content too. This scarcity of data raises the stakes for generative AI developers, a cohort that has the funding to license data from content producers at all levels. While the ultimate result of this content licensing remains uncertain, it’s clear that the outcome of this licensing process will have enormous impact on the future of media and all forms of information flow in the years ahead.

Read our guide to AI governance and policy to learn what you need to know about responsible use of artificial intelligence.
The post OpenAI Strikes Publisher Deals to Reduce AI Web Scraping Tensions appeared first on eWEEK.