For Data-Guzzling AI Companies, the Internet Is Too Small

Monday April 1, 2024. 04:44 PM , from Slashdot

Companies racing to develop more powerful artificial intelligence are rapidly nearing a new problem: The internet might be too small for their plans (non-paywalled link). From a report: Ever more powerful systems developed by OpenAI, Google and others require larger oceans of information to learn from. That demand is straining the available pool of quality public data online at the same time that some data owners are blocking access to AI companies. Some executives and researchers say the industry's need for high-quality text data could outstrip supply within two years, potentially slowing AI's development.

AI companies are hunting for untapped information sources, and rethinking how they train these systems. OpenAI, the maker of ChatGPT, has discussed training its next model, GPT-5, on transcriptions of public YouTube videos, people familiar with the matter said. Companies also are experimenting with using AI-generated, or synthetic, data as training material -- an approach many researchers say could actually cause crippling malfunctions. These efforts are often secret, because executives think solutions could be a competitive advantage.

Data is among several essential AI resources in short supply. The chips needed to run what are called large-language models behind ChatGPT, Google's Gemini and other AI bots also are scarce. And industry leaders worry about a dearth of data centers and the electricity needed to power them. AI language models are built using text vacuumed up from the internet, including scientific research, news articles and Wikipedia entries. That material is broken into tokens -- words and parts of words that the models use to learn how to formulate humanlike expressions.

Read more of this story at Slashdot.