Why your AI models stumble before the finish line

Tuesday November 12, 2024. 10:00 AM , from InfoWorld

In 2023, enterprises across industries invested heavily in generative AI proof of concepts (POCs), eager to explore the technology’s potential. Fast-forward to 2024, companies face a new challenge: moving AI initiatives from prototype to production.

According to Gartner, by 2025, at least 30% of generative AI projects will be abandoned after the POC stage. The reasons? Poor data quality, governance gaps, and the absence of clear business value. Companies are now realizing that the primary challenge isn’t simply building models — it’s ensuring the quality of the data feeding those models. As companies aim to move from prototype to production of models, they’re realizing that the biggest roadblock is curating the right data.

More data isn’t always better

In the early days of AI development, the prevailing belief was that more data leads to better results. However, as AI systems have become more sophisticated, the importance of data quality has surpassed that of quantity. There are several reasons for this shift. Firstly, large data sets are often riddled with errors, inconsistencies, and biases that can unknowingly skew model outcomes. With an excess of data, it becomes difficult to control what the model learns, potentially leading it to fixate on the training set and reducing its effectiveness with new data. Secondly, the “majority concept” within the data set tends to dominate the training process, diluting insights from minority concepts and reducing model generalization. Thirdly, processing massive data sets can slow down iteration cycles, meaning that critical decisions take longer as data quantity increases. Finally, processing large data sets can be costly, especially for smaller organizations or startups.

Organizations must strike a delicate balance between having enough data to train robust models and ensuring that it’s the right data. This means moving beyond data accumulation and focusing on data quality. By investing in practices like cleaning, validation, and enrichment, companies can ensure that their AI models are not only built on a solid foundation of high-quality data but are also well-prepared to scale and perform effectively in real-world production environments.

The price of poor data quality

A study by IBM found that poor data quality costs the United States economy around $3.1 trillion annually. Across industries, this issue is the root cause of AI initiatives stalling after proof of concept, draining resources and blocking companies from achieving full production-scale AI.

Beyond direct financial losses, failed AI projects incur significant indirect costs, including wasted time and computational resources. Most critically, these failures represent missed opportunities for a competitive advantage and can damage both internal and external reputations. Repeated failures can create a culture of risk aversion, stifling the very innovation that AI promises to deliver.

Research indicates that data scientists spend approximately 80% of their time preparing and organizing data before they can conduct any meaningful analysis.

The key characteristics of high-quality data

To overcome the root challenge of poor data quality, high-performance AI data sets must exhibit five key characteristics:

Accuracy in reflecting real-world scenarios,

Consistency in format and structure,

Diversity to enhance adaptability,

Relevance to specific objectives, and

Ethical considerations in data collection and labeling.

To illustrate the importance of these characteristics, consider an example from Automotus, a company that automates payments for vehicle unloading and parking. The company faced challenges with poor data quality, including duplicate and corrupt images, which hindered their ability to convert vast amounts of image data into labeled training data sets for their AI models. To address these issues, the company used data quality tools to efficiently curate and reduce their data set by removing the bad examples—achieving a 20% improvement in mean average precision (mAP) for their object detection models. While the data reduction enhanced model accuracy, it further led to a 33% reduction in labeling costs, demonstrating that investing in data quality can yield both performance improvements and economic benefits.

How to achieve high-quality data

To navigate the challenges of AI development, organizations should take the following concrete steps to enhance their data practices:

Establish clear data governance policies: Organizations should create comprehensive data governance policies that outline roles, responsibilities, and standards for data management. These guidelines ensure uniform data quality throughout the organization, reducing the risk of poor data impacting decision-making.

Implement rigorous data cleaning techniques: Employ techniques such as outlier detection, imputation for missing values, and normalization to maintain the integrity of data sets. These practices help ensure that the data used for AI models is accurate and reliable.

Invest in accurate labeling processes: High-quality labels are essential for model precision. Automated data labeling can offer significant advantages over manual labeling by reducing costs and streamlining the process. However, a hybrid approach that combines automated tools with human oversight can enhance accuracy by leveraging the strengths of both methods.

Source data from diverse and reliable sources: Companies should seek diverse data sources to reduce bias and improve model performance. Examples include public data sets, industry-specific databases, and third-party data providers. Ensuring these sources are credible is crucial for maintaining data quality.

Leverage advanced data management tools: To ensure ongoing AI performance, leverage advanced data management tools to continuously curate and update training data sets. Data distributions can change over time in production environments, and these tools can help companies adapt data sets accordingly.

Elevate data quality to scale AI

The demand for high-quality data will only grow as AI adoption increases. Gartner predicts that by 2025, enterprises will process 75% of their data outside traditional data centers or the cloud, highlighting the need for new strategies to maintain data quality in distributed environments. To confront these obstacles, key innovations are emerging in the field of data quality, including automated data checks, machine learning for data cleaning, privacy-preserving methods for training models on distributed data, and the generation of synthetic data to enhance real data sets.

These advancements are making it possible — and easy — for every company to create a data-centric culture. By prioritizing data quality, companies aren’t merely avoiding pitfalls, but unlocking AI’s full potential and setting new industry standards. It’s time to rally around the power of quality data — not just for competitive advantage, but to elevate the entire AI ecosystem. As AI continues to mature, the question isn’t “Do we have enough data?” Instead, it’s time to ask, “Do we have the right data to power the AI solutions of tomorrow?”

Ulrik Stig Hansen is co-founder and president of Encord, an AI data development platform built to help companies manage and prepare their data for AI.

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.