Understanding the Future of Apache Iceberg Catalogs

Java, REST and the Future of Open Composable Data Lakehouses

Apr 05, 2024

Apache Iceberg is revolutionizing the data industry as an open-source table format that allows data lake storage layers to function as full-fledged data warehouses, a concept known as a data lakehouse. This transformation has led to the development of comprehensive data lakehouse platforms and lakehouse management tools, creating a robust ecosystem for modular data warehousing. At the heart of these lakehouse systems is the catalog, which tracks tables so that various tools can identify and interact with them efficiently.

What is an Apache Iceberg Catalog

In a recent article, I explored the workings of Apache Iceberg catalogs and the existing catalogs within the ecosystem. Essentially, at the most basic level, Apache Iceberg catalogs maintain a list of tables, each linked to the location of its latest "metadata.json" file.

The Status Quo of Catalog Production

Initially, Apache Iceberg's API was predominantly Java-based, utilizing a Catalog class. Each catalog implementation would inherit from this class to ensure compatibility within the Apache Iceberg Java ecosystem. While effective, this approach faced challenges as the Iceberg API expanded into other languages like Python, Go, and Rust. Each catalog implementation had to be rewritten for these languages, often leading to inconsistencies.

Moreover, at the tooling level, the reliance on different Java classes for each catalog meant that tools had to develop and test support for each catalog individually. This reliance on client-side code interactions slowed down the adoption of new catalog implementations by tools.

The REST Catalog

To address this issue, the concept of the "REST Catalog" was introduced. Its goal is to provide a single, language-agnostic interface that allows tools to interact with any catalog in any language, eliminating the need for custom connectors. In the current iteration of the REST Catalog specification, there are numerous endpoints for standard catalog operations, such as retrieving and updating metadata references, among other functions. The objective is for all catalogs to eventually achieve compatibility with this interface, enabling tools to interact seamlessly with any Iceberg catalog, whether it's an in-house, vendor, or open-source implementation.

The Future of the REST Catalog

Recently, a new proposal has been put forward to initiate discussions on the next iteration of the REST catalog, aiming to shift many operations from the client side to the server side. This change would enable catalog implementations to optimize Apache Iceberg operations across various tools and become more adaptable. It would accommodate better extensibility enabling unique features like Nessie's Git-like catalog versioning, among other distinctive capabilities that different implementations might wish to introduce. Once this new iteration is refined and adopted, it will not only establish a standardized interface for all catalogs but also create an environment where these catalogs can innovate within that standard framework, experiment, and introduce a range of new advantages to Apache Iceberg.

Conclusion

Apache Iceberg's table specification and catalog interface are continually evolving to support a broad, open ecosystem for constructing modular and composable data systems. There are compelling reasons to adopt Apache Iceberg as the foundation of your data lakehouse. If you haven't yet experienced building an Apache Iceberg lakehouse, I recommend reading this article for a practical exercise you can perform on your laptop.

Open Data Lakehouses for Everyone

Discussion about this post