Navigation
Search
|
AI Controversy: OpenAI Accused By O’Reilly of Training AI on Its Paywalled Content
Monday April 7, 2025. 09:16 PM , from eWeek
Meta recently faced accusations of training its AI models on pirated content; now, OpenAI finds itself entangled in a similar controversy. A new study claims that one of OpenAI’s latest large language models (LLMs) was trained on non-public, copyrighted-projected material from O’Reilly Media. Specifically, authors of the study suggest that OpenAI’s development teams may have trained one of their most advanced models on restricted content without authorization.
The study’s authors wrote, in part: “Although the evidence present here on model access violations is specific to OpenAI and O’Reilly Media books, this is likely a systematic issue.” Examining the accusations The study was written by a team with O’Reilly Media, including CEO Tim O’Reilly. It explicitly claims that OpenAI, one of today’s top AI companies, is training one of its most recent AI models on content that is locked behind a paywall through O’Reilly Media’s official channels. The authors of the study titled “Beyond Public Access in LLM Pre-Training Data” started with 34 copyrighted books from O’Reilly Media, including content that was publicly available and paywalled. Next, they applied the DE-COP membership inference attack method, which is a way of determining whether an AI model has already memorized a specific text, to investigate various types of AI models from OpenAI. The team also assigned an Area Under the Receiver Operating Characteristic (AUROC) score to each LLM. This score measures the likelihood that these AI models were trained using one or more of the 34 copyrighted books from O’Reilly Media. GPT-4o: Demonstrates stronger recognition of non-public content from O’Reilly Media (AUROC score: 82%) than public content (AUROC score: 64%). GPT-3.5 Turbo: Demonstrates slightly stronger recognition of public content from O’Reilly Media (AUROC score: 64%) than non-public (AUROC score: 54%). GPT-4o Mini: No indication the model was trained on public or non-public content from O’Reilly Media. Reading the fine print While their study initially absolves GPT-4o Mini of any infringement, the study notes that this could be a result of the AI model’s smaller scale and its inability to remember as much text as GPT-4o and other generative AI tools. Their study also expresses some uncertainty surrounding the AUROC scores, noting that these are meant to be taken as estimates. The study concludes by suggesting that current AI training methods may soon lead to an “extractive dead end.” By failing to compensate the copyright owners and content creators, AI developers will ultimately see diminished content quality, accuracy, and diversity. The post AI Controversy: OpenAI Accused By O’Reilly of Training AI on Its Paywalled Content appeared first on eWEEK.
https://www.eweek.com/news/openai-o-reilly-copyright/
Related News |
25 sources
Current Date
Apr, Tue 8 - 01:08 CEST
|