New 'Open Source AI Definition' Criticized for Not Opening Training Data

Sunday November 3, 2024. 06:34 PM , from Slashdot

Long-time Slashdot reader samj — also a long-time Debian developer — tells us there's some opposition to the newly-released Open Source AI definition. He calls it a 'fork' that undermines the original Open Source definition (which was originally derived from Debian's Free Software Guidelines, written primarily by Bruce Perens), and points us to a new domain with a petition declaring that instead Open Source shall be defined 'solely by the Open Source Definition version 1.9. Any amendments or new definitions shall only be recognized with clear community consensus via an open and transparent process.'

This move follows some discussion on the Debian mailing list:

Allowing 'Open Source AI' to hide their training data is nothing but setting up a 'data barrier' protecting the monopoly, disabling anybody other than the first party to reproduce or replicate an AI. Once passed, OSI is making a historical mistake towards the FOSS ecosystem.

They're not the only ones worried about data. This week TechCrunch noted an August study which 'found that many 'open source' models are basically open source in name only. The data required to train the models is kept secret, the compute power needed to run them is beyond the reach of many developers, and the techniques to fine-tune them are intimidatingly complex. Instead of democratizing AI, these 'open source' projects tend to entrench and expand centralized power, the study's authors concluded.'

samj shares the concern about training data, arguing that training data is the source code and that this new definition has real-world consequences. (On a personal note, he says it 'poses an existential threat to our pAI-OS project at the non-profit Kwaai Open Source Lab I volunteer at, so we've been very active in pushing back past few weeks.')

And he also came up with a detailed response by asking ChatGPT. What would be the implications of a Debian disavowing the OSI's Open Source AI definition? ChatGPT composed a 7-point, 14-paragraph response, concluding that this level of opposition would 'create challenges for AI developers regarding licensing. It might also lead to a fragmentation of the open-source community into factions with differing views on how AI should be governed under open-source rules.' But 'Ultimately, it could spur the creation of alternative definitions or movements aimed at maintaining stricter adherence to the traditional tenets of software freedom in the AI age.'

However the official FAQ for the new Open Source AI definition argues that training data 'does not equate to a software source code.'

Training data is important to study modern machine learning systems. But it is not what AI researchers and practitioners necessarily use as part of the preferred form for making modifications to a trained model.... [F]orks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data...

[W]e want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information — like decisions about their health. Similarly, much of the world's Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

Read on for the rest of their response...

Read more of this story at Slashdot.