US Lawmaker Proposes a Public Database of All AI Training Material

Thursday April 11, 2024. 11:25 PM , from Slashdot

An anonymous reader quotes a report from Ars Technica: Amid a flurry of lawsuits over AI models' training data, US Representative Adam Schiff (D-Calif.) has introduced (PDF) a bill that would require AI companies to disclose exactly which copyrighted works are included in datasets training AI systems. The Generative AI Disclosure Act 'would require a notice to be submitted to the Register of Copyrights prior to the release of a new generative AI system with regard to all copyrighted works used in building or altering the training dataset for that system,' Schiff said in a press release.

The bill is retroactive and would apply to all AI systems available today, as well as to all AI systems to come. It would take effect 180 days after it's enacted, requiring anyone who creates or alters a training set not only to list works referenced by the dataset, but also to provide a URL to the dataset within 30 days before the AI system is released to the public. That URL would presumably give creators a way to double-check if their materials have been used and seek any credit or compensation available before the AI tools are in use. All notices would be kept in a publicly available online database.

Currently, creators who don't have access to training datasets rely on AI models' outputs to figure out if their copyrighted works may have been included in training various AI systems. The New York Times, for example, prompted ChatGPT to spit out excerpts of its articles, relying on a tactic to identify training data by asking ChatGPT to produce lines from specific articles, which OpenAI has curiously described as 'hacking.' Under Schiff's law, The New York Times would need to consult the database to ID all articles used to train ChatGPT or any other AI system. Any AI maker who violates the act would risk a 'civil penalty in an amount not less than $5,000,' the proposed bill said. Schiff described the act as championing 'innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets.'

'This is about respecting creativity in the age of AI and marrying technological progress with fairness,' Schiff said.

Read more of this story at Slashdot.