Bridging the performance gap in data infrastructure for AI

Monday October 28, 2024. 10:00 AM , from InfoWorld

In the current technology landscape, organizations are looking to AI to provide transformative product differentiation and groundbreaking new revenue streams. In 2023, large language models (LLMs) dazzled folks with the possibility of new capabilities, features, and products. In 2024 and beyond, we’re now focused on the reality of bringing those ideas to fruition and the challenges of what that means for data infrastructure. For most, the road to AI success is not smooth, as organizations find their legacy data ecosystem will not suffice for data processing today, let alone tomorrow.

As the need for data as a differentiator builds, organizations are grappling with the daunting task of modernizing their infrastructure and phasing out legacy systems, while concurrently delivering traditional analytics without interruption. Yet delivering new value through data is pivotal for augmenting AI capabilities and maintaining a competitive edge. A significant chasm exists between an organization’s current infrastructure capabilities and the requirements necessary to effectively support AI workloads, manifesting most prominently in the realm of performance.

Despite the seemingly Herculean efforts required to build an AI program, there are key considerations that data infrastructure architects should focus on to move forward. By understanding the limitations of legacy data infrastructure, new capabilities can be unlocked by building flexible, scalable, performance-focused systems that streamline the path to value for data.

A look at legacy data infrastructure

To understand the data ecosystem that has prevailed over the past decade, let’s start by thinking about what’s been built thus far. Early data infrastructure was a coupled system of data storage along with compute capabilities, built specifically to interact with the storage layer. The storage layer was composed of physical servers, often in dedicated on-premises data centers, and media included hard disk drives, magnetic tapes, and optical disks. This storage was typically organized into hierarchical file systems or relational databases.

Classic compute environments in early data infrastructures relied heavily on mainframes and minicomputers, transitioning to client-server architectures by the 1990s. Compute tasks were performed by dedicated hardware, with multi-core processors and early virtualization technologies improving efficiency and resource utilization. Manual SQL queries and programmatic access via ODBC/JDBC were common for data interaction, while ETL processes moved data between operational systems and data warehouses. As technology progressed, the integration of distributed computing and early cloud services began to reshape these environments, paving the way for the scalable, flexible compute infrastructures we rely on today.

The classic data pipeline

Alongside this legacy data infrastructure, data pipelines were built: raw data is moved from a source operational or transactional system (e.g., ATMs generating transaction data) to an analytics-focused system via ETL or ELT, into either a data warehouse or data lake (or data lakehouse for those wanting the “best of both worlds”). Either way, data needs to be transformed from a physical and technical format into a more business-friendly and value-driven schema for consumption by downstream users. These transformations typically involve cleaning, aggregating, and joining data sets to produce more productized data sets that are valuable business products.

The path to value for data was focused on moving from a transaction-focused system to a data- and analytics-focused system and transforming data along the way. This path to value has remained largely unchanged despite advances in the underlying technologies, such as the move from mainframes to x86, hard disk drives to flash, etc.

Consider the cloud

One of the most important evolutionary moments in the history of computing was the introduction of cloud computing. By commoditizing hardware, traditional data architectures could be effectively lifted and shifted into “someone else’s data center,” allowing more flexibility and changing the face of infrastructure considerations in technical architecture. Organizations were able to offload the management of physical infrastructure to cloud providers, and this shift enabled businesses to focus more on data processing and analytics while leveraging the cloud’s capabilities for storage, compute, and advanced services like AI and machine learning.

However, the onset of cloud computing did not fundamentally change the way data pipelines were built. The path to value for data remains the same regardless of whether it is in a data center or the cloud. This isn’t to say that cloud computing hasn’t advanced data processing in performance, scalability, compute capabilities, or other major ways. However, the data pipeline from source to consumption has remained largely unchanged. One could argue that the so-called “modern data stack” is simply a modularized, SaaS- and cloud-based version of the same legacy architecture that’s been around for decades.

Enter AI

Although artificial intelligence has been around for years in the form of machine learning algorithms, it’s important to recognize that the latest advances in AI are incredibly different from this traditional approach to data science.

The scale of unstructured data

Modern AI systems process vast amounts of unstructured data, requiring scalable infrastructure to handle the increased volume and complexity. Thus far, data infrastructure has focused on structured data, but contemporary data collections are up to 95% unstructured. This means that systems built for terabytes now need to accommodate petabytes and exabytes of data, which forces hard conversations about architectures. For example, a cloud- and SaaS-optimized data ecosystem may be optimal for business intelligence (BI) and traditional machine learning but will lack the capabilities to deal with unstructured data. Further, scaling such a system for AI could be cost-prohibitive or lack the performance capabilities to be feasible for AI.

The performance gap

The performance requirements for advanced AI models have driven the adoption of GPUs and specialized hardware, dramatically changing infrastructure needs. This shift has allowed for faster training and inference times, enabling businesses to leverage AI for real-time analytics, enhanced decision-making, and innovative applications previously unattainable with traditional data science methods. However, the onset of rapidly advancing AI technologies, such as retrieval-augmented generation (RAG) and generative AI models, has intensified the need for high performance. This demands not only superior processing power but also an agile infrastructure capable of evolving with the pace of AI development.

Businesses now face the challenge of maintaining cutting-edge hardware and optimizing their data pipelines to ensure AI models perform efficiently and effectively. Keeping up with these new developments in infrastructure is paramount, as falling behind can mean missing out on the competitive advantages that advanced AI promises. This performance gap underscores the critical need for continuous innovation and investment in AI-specific infrastructure to fully harness the transformative potential of modern AI technologies.

The AI data pipeline

It’s important to note that the process of transforming data described above is what makes data valuable to an organization; it’s the “secret sauce” that applies business-specific logic to the data and ultimately makes it a valuable asset. This application of business logic is essential to BI, machine learning, and AI alike. In traditional data systems, this transformation process typically involves structuring data, cleaning it, and aggregating it to produce actionable insights. However, with the advent of new paradigms such as generative AI, the requirements have become significantly more complex and demanding and build off of the traditional data pipeline.

Generative AI shakes things up via the inclusion of unstructured data, such as text, images, and audio, which introduces new challenges in data processing and integration. In addition to technical requirements to accommodate scale and performance, the specific use cases of generative AI—such as real-time content generation, dynamic personalization, and complex decision-making—demand that data pipelines be exceptionally agile and capable of integrating insights from various data sources swiftly.

What’s more, while legacy data pipelines focus on a forward movement of data from source to processing to target, AI pipelines are more cyclical, where data can be used and then fed back into the system for improving algorithmic output. This additional complexity, along with the diversity of multi-modal data in the pipeline, means that the flexible scalability and performance of data pipelines become absolutely critical to AI, let alone to whatever comes next for the use cases of data intelligence.

Therefore, important considerations for building data pipelines include not just scaling the infrastructure but also rethinking the design of the pipelines themselves, to ensure they can support the rapid iteration and deployment of evolving AI models. Effective management of these pipelines is crucial for maintaining high performance and achieving the desired outcomes from AI initiatives, making it a key focus for organizations aiming to leverage the full potential of modern AI technologies.

A forward-looking data architecture for AI

How do we build performant, scalable, flexible, and cost-efficient data pipelines, given all of the above considerations? It can be daunting. For example, as new compute capabilities come online, it’s important to have the ability to upgrade to new hardware in a highly reliable environment. Modern data architectures must be designed with the flexibility and scalability to seamlessly integrate cutting-edge hardware and software advancements as they come online. This includes adopting modular and containerized approaches that allow for the quick deployment of new technologies without significant downtime or disruption to existing workflows.

One example, the VAST Data Platform, offers unified storage, database, and data-driven function engine services built for AI, enabling seamless access and retrieval of data essential for AI model development and training. With enterprise-grade security and compliance features, the platform can capture, catalog, refine, enrich, and preserve data through real-time deep data analysis and learning to ensure optimal resource utilization for faster processing, maximizing the efficiency and speed of AI workflows across all stages of a data pipeline.

Hybrid and multicloud strategies

It can be tempting to pick a single hyperscaler and use the cloud-based architecture they provide, effectively “throwing money at the problem.” Yet, to achieve the level of adaptability and performance required to build an AI program and grow it, many organizations are choosing to embrace hybrid and multicloud strategies. By leveraging a combination of on-premises, private cloud, and public cloud resources, businesses can optimize their infrastructure to meet specific performance and cost requirements, while garnering the flexibility required to deliver value from data as fast as the market demands it. This approach ensures that sensitive data can be securely processed on-premises while taking advantage of the scalability and advanced services offered by public cloud providers for AI workloads, thus maintaining high compute performance and efficient data processing.

Embracing edge computing

As AI applications increasingly demand real-time processing and low-latency responses, incorporating edge computing into the data architecture is becoming essential. By processing data closer to the source, edge computing reduces latency and bandwidth usage, enabling faster decision-making and improved user experiences. This is particularly relevant for IoT and other applications where immediate insights are critical, ensuring that the performance of the AI pipeline remains high even in distributed environments.

Data governance and security

In a forward-looking AI architecture, robust data governance and security are more important than ever. With the increasing volume and complexity of data, ensuring data integrity, privacy, and compliance with moving government regulation targets becomes even more critical. Implementing comprehensive data governance frameworks and leveraging AI-driven security solutions can help protect sensitive information and maintain trust with stakeholders, which is essential for maintaining the overall performance and reliability of the AI data pipeline.

Integration of AI and machine learning workflows

A forward-looking data architecture should also facilitate the seamless integration of AI and machine learning workflows to maximize performance. This involves creating pipelines that support the entire data life cycle, from ingestion and preprocessing to model training, deployment, and monitoring. Utilizing the latest and greatest devops strategies, like containerization and infrastructure as code, alongside MLops platforms for continuous delivery of models can significantly enhance operational efficiency and model performance. Streamlining these processes ensures that the data pipeline is optimized to swiftly and efficiently get data to the consumption layer, reducing bottlenecks and improving the speed and accuracy of AI insights.

Investment in talent and skills

Finally, a forward-looking AI architecture requires a significant investment in talent and skills. Organizations must prioritize hiring and training data and IT professionals who are well-versed in the latest AI technologies and best practices. Cultivating a culture of continuous learning and innovation will ensure that the organization remains at the forefront of AI advancements and can effectively leverage new opportunities as they arise, ultimately enhancing the performance of AI systems and infrastructure.

By adopting a forward-looking data architecture focused on performance, businesses can position themselves to fully capitalize on the transformative potential of AI. Taking a proactive approach to AI infrastructure ensures organizations remain at the forefront of technological innovation, enabling them to unlock the full potential of their data and achieve their strategic objectives in an increasingly competitive landscape. This strategy drives innovation, efficiency, and competitive advantage in an increasingly data-driven world, effectively bridging the performance gap in AI infrastructure.

Colleen Tartow, PhD, is field CTO and head of strategy at VAST Data.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.