Navigation
Search
|
Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
Thursday December 12, 2024. 08:35 AM , from Slashdot
Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta's Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to 'level the playing field' by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. 'It's gone through rigorous review,' he says. Leppert believes the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. 'I think about it a bit like the way that Linux has become a foundational operating system for so much of the world,' he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors. Read more of this story at Slashdot.
https://slashdot.org/story/24/12/12/0734228/harvard-is-releasing-a-massive-free-ai-training-dataset-...
Related News |
25 sources
Current Date
Dec, Thu 12 - 11:50 CET
|