Breaking through AI data bottlenecks

Tuesday October 1, 2024. 10:30 AM , from InfoWorld

As AI models become increasingly commoditized, the data required to train and fine-tune them has never been more critical. While procuring high-quality data is expensive and raises privacy concerns, there is a powerful alternative that companies like Google and JPMorgan are exploring: synthetic data. As enterprises move beyond experimenting with general-purpose models to fine-tuning their own specialized AI models, synthetic data is emerging as a key solution to break through common bottlenecks and drive the next wave of innovation.

Here‘s how synthetic data is addressing three major bottlenecks in specialized machine learning and AI model development.

The data scarcity bottleneck

One of the most significant bottlenecks in training specialized AI models is the scarcity of high-quality, domain-specific data. Building enterprise-grade AI requires increasing amounts of diverse, highly contextualized data, of which there are limited supplies. This scarcity, sometimes known as the “cold start” problem, is only growing as companies license their data and further segment the internet. For startups and leading AI teams building state-of-the-art generative AI products for specialized use cases, public data sets also offer capped value, due to their lack of specificity and timeliness.

While major players like OpenAI are dredging the internet for any potentially useful data — an approach fraught with consent, copyright, privacy, and quality issues — synthetic data offers a more targeted, secure, and ethical solution. By synthesizing unlimited variations and edge cases based on existing seed data, synthetic data allows organizations to:

Expand limited proprietary data sets or even expand on seed examples from expert users to form a robust foundation for training specialized models.

Create data for rare or “what if” scenarios that may not exist in real-world data sets.

Rapidly iterate and experiment with different data distributions and curations to optimize model performance.

Synthesizing data not only increases the volume of training data but also enhances its diversity and relevance to specific problems. For instance, financial services companies are already using synthetic data to rapidly augment and diversify real-world training sets for more robust fraud detection — an effort that is supported by financial regulators like the UK’s Financial Conduct Authority. By using synthetic data, these companies can generate simulations of never-before-seen scenarios and gain safe access to proprietary data via digital sandboxes.

The data quality and management bottleneck

Even when organizations have substantial data, they often face a bottleneck in terms of data quality and organization. This issue manifests in at least three ways:

Data drift and model collapse: Existing training data sets may become outdated or irrelevant over time, leading to models progressively losing their ability to accurately represent the full spectrum of real-world scenarios they need to account for.

Incomplete or unbalanced data: Real-world data sets often have gaps or biases that can skew model training.

Lack of proper annotation: Effective model training requires well-labeled data, but manual annotation is time-consuming and prone to biases and inconsistencies.

Synthetic data breaks through this bottleneck by:

Generating high-quality data to fill gaps in existing data and correct for biases.

Creating fully annotated information tailored to specific industry rules or compliance requirements, eliminating the need for manual labeling.

Allowing for rapid scaling of the data annotation process, significantly reducing time and resource constraints.

Using synthetic data results in cleaner, more organized data that can dramatically improve model accuracy and efficiency.

The data privacy and security bottleneck

For many organizations, especially those in highly regulated industries, data privacy and security concerns create a significant bottleneck in AI development. Stringent privacy standards and tightening regulations, such as the GDPR and EU AI Act, restrict the amount of valuable data that is usable for AI initiatives.

Synthetic data, when combined with modern privacy-preserving techniques like differential privacy, shatters this bottleneck by serving as a secure interface to access rich data insights without compromising individual privacy. This approach allows organizations to:

Leverage sensitive data that would otherwise be off-limits for AI training.

Safely share and collaborate on data-driven projects across departments, between organizations, and in the public, open community.

Comply with stringent data protection regulations and respect consumer privacy, while still advancing applied science and innovating with AI.

In the healthcare sector, synthetic data is enabling companies to safely anonymize and operationalize data from electronic health records and transcripts, powering use cases from analytics to customized LLM training sets without compromising patient privacy.

The path forward: Synthesized data

By breaking through these critical bottlenecks, synthetic data is democratizing access to AI innovation and enabling the development of highly specialized, sustainable AI models that were previously out of reach for many organizations.

As we move forward, the quality, relevance, and ethical use of training data will increasingly determine the success of AI initiatives. It’s no longer just about how sophisticated your model is, but how good your data is.

Synthetically designed data is cleaner, more customizable, less biased, and faster than traditional real-world data. It opens up new possibilities for safe data collaborations and AI development that will benefit startups, scientists and researchers, global brands, and governments alike.

As AI continues to evolve, the role of synthetic data in breaking through bottlenecks and enabling agile and iterative model training will only grow in importance. Organizations that embrace this technology now will be well-positioned to lead in the AI-driven future.

Alex Watson is co-founder and chief product officer at Gretel.

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.