Navigation
Search
|
Will the non-English genAI problem lead to data transparency and lower costs?
Wednesday February 12, 2025. 11:00 AM , from ComputerWorld
It’s become increasing clear that quality plunges when moving from English to non-English-based large language models (LLMs). They’re less accurate and there’s a serious lack of transparency around data training, both in terms of data volume and data quality.
The latter has long been a problem for generative AI (genAI) tools and platforms. But enterprises aren’t paying less for less-productive models, even though the value they offer is diminished. So, why aren’t CIOs getting a price break for non-English models? Because without any data transparency, they rarely know they’re paying more for less. There are a variety of reasons why model makers don’t disclose their data training particulars. (Let’s not even get into the issue of whether they have legal rights to do whatever training they did — though it’s tempting to do so, if only to explore the hypocrisy of OpenAI complaining about DeepSeek not getting permission before training on much of its data.) Speaking of DeepSeek, don’t read too much into the lower cost of its underlying models. Yes, its builders cleverly leveraged open source to find efficiencies and lower pricing, but there’s been little disclosure of how much the Chinese government helped with DeepSeek’s funding, either directly or indirectly. That said, if DeepSeek is the cudgel that puts downward pressure on genAI pricing, I’m all for it — and IT execs should be, too. But until we see evidence of meaningful price cuts, they should use the lack of data transparency in non-English models to try and get model maker pricetags out of the stratospheric. The non-English issue isn’t really about the language, per se. It’s more about the training data that is available within that language. (By some estimates, the training datasets for non-English models could be just 1/10 or even 1/100 the size of their English counterparts.) Hans Florian, whose title is a distinguished research scientist for multilingual natural language processing at IBM, said he uses a trick to guesstimate how much data is available in various languages. “You can look at the number of Wikipedia pages in that language. That correlates quite well with the amount of data available in that language,” he said. To further complicate the issue, sometimes it’s not about the language or the available data in that language. It can — logically enough — be about data related to activities in the region where a particular language is dominant. If model makers start seeing meaningful pricing pushback from a lot of enterprises concerned about model quality, they have only a couple of options. They can selectively — and secretly — negotiate lower prices for non-English models for some of their customers — or they can get serious about data transparency. Because LLM makers have invested billions of dollars in genAI, they aren’t going to like the idea of lower pricing. That leads to that second option: deliver full transparency to all customers about all models — both in terms of quantity and quality — and price their wares accordingly. Given that quality is almost impossible to represent numerically, that will mean disclosing all training data details so each customer can make their own determination of quality for the topics, verticals and geographies they care about. The pricing disparity between what a model can deliver and what an enterprise is forced to pay is at the heart of why CIOs are still struggling to deliver genAI ROI. Obviously, lower pricing would be the best way to improve the ROI for genAI investments. But if that’s not going to happen anytime soon, full data transparency is the next best thing. There is a catch: model makers almost certainly realize that full data-training transparency will likely force them to lower prices, since it would showcase how low quality their data is. Note: I say that their data is low-quality as if it’s a given; it is absolutely a given. If model makers believed they were using lots of high-quality data, far from resisting transparency, they would embrace it. It would be a selling point. It might even be useful for propping up prices. High quality usually sells itself. Their refusal to deliver any kind of data-training transparency tells you everything you need to know about their quality beliefs, and about the state of the market at the moment.
https://www.computerworld.com/article/3822069/will-the-non-english-genai-problem-lead-to-data-transp...
Related News |
25 sources
Current Date
Feb, Wed 12 - 23:08 CET
|