Why Apache Iceberg is on fire right now

Wednesday July 31, 2024. 10:30 AM , from InfoWorld

Enterprises have come to view their data as a vital asset, and like other vital assets, they want control over that data to extract its maximum value. That means having the ability to leverage the multitude of tools and frameworks that have emerged in recent years to support AI and analytics, and that requires standards that are truly open.

A lot of enterprise data now lives in data lakes, which are ideal for storing vast amounts of structured and unstructured data. Data lakes provide organizations with a comprehensive way to explore, refine, and analyze petabytes of information that may be arriving constantly from multiple sources.

But data lakes that depend on proprietary formats make it impossible — or at least cost-prohibitive — to share and access data across different platforms and tools. Large enterprises often use multiple data platforms and processing engines, and data teams need a secure way to access data across these environments without the high costs, risks, and technical debt associated with making additional copies.

That’s why Apache Iceberg has become one of the hottest open source projects, because it provides an open table format for interoperability across data lakes. Apache Iceberg shows the importance of a true open standard, and it’s a model the industry should follow for data catalogs and other important parts of the data infrastructure stack.

The rise of Apache Iceberg

Apache Iceberg prescribes a standard for the metadata that defines a table, its schema, its history, and each file that composes a table. It also ensures ACID compliance, allowing multiple applications to safely work on the same data simultaneously.

With Iceberg, companies can avoid needing to copy or duplicate their data to use it with different processing engines and tools. Iceberg creates a clean separation layer between the data and the data management layer. That allows organizations to take advantage of low-cost cloud storage, for example, and plug in any Iceberg-compatible processing engine or other tool.

What’s important here is that Iceberg is truly open, because it’s managed by the Apache Software Foundation. No single person or organization controls its features. Instead, it’s governed by a coalition of contributors who agree on which features will be added next and who ensure together that it remains interoperable and optimized across all of their products.

What it means to be open

Apache Iceberg highlights the critical difference between open code and open governance. A vendor can put their code on GitHub and make it available for others to use as open source software, but its direction and features are still governed by a single company. This “benevolent dictator” approach does not ensure interoperability, meaning that customers can’t move data between platforms and therefore don’t have control of their own data assets.

This is out of step with the reality at large businesses today. Most enterprises are using multiple hyperscalers, data platforms, and processing engines, accumulated over the years through a combination of acquisitions or architectural decisions that made sense at the time. Without open standards to ensure interoperability, enterprises have to shoulder the expense and technical debt of copying data across multiple platforms. Besides being extremely costly, copies of data rarely stay in sync for long, which means analysts end up building reports in multiple platforms that give different answers about the business.

While Iceberg solves this problem, open standards are needed in other areas as well. We’re now seeing a new battlefield emerging in the area of data catalogs, which play a critical role in a multi-engine architecture. Catalogs make operations on tables reliable by supporting atomic transactions. This means that data engineers and the pipelines they build can modify tables concurrently, and queries on these tables produce accurate results. To accomplish this, all Iceberg-table read and write operations, even from different engines, are routed through a catalog.

SaaS providers and hyperscalers can use the catalog as a way to create customer stickiness, but enterprises are getting wise to this. They understand that, just as Iceberg provides a common format for tables, an open catalog standard will let them choose the best tool for the job and maximize the value of their data.

Open standards are good for business, good for customers, and good for the wider ecosystem. Enterprises have complex data architectures, and open standards allow them to use data across these platforms without incurring additional cost and governance challenges. Open standards also promote innovation, because they force companies to compete on an implementation and allow customers to choose between them.

Ensuring a customer-centric data ecosystem

The rapid growth of Apache Iceberg underscores the value of open standards amid today’s complex data architectures. As organizations try to bridge disparate data systems and make the most of the best data tools available to them, interoperability is more than a convenience, it’s an imperative. Iceberg’s model of transparent management and diversity of contributors helps customers regain control over their data assets. It’s an approach the industry should follow throughout the data stack, ensuring a more open, interoperable, and customer-centric data ecosystem.

James Malone is head of data storage and data engineering at Snowflake.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.