Enabling Presto Iceberg to leverage the powerful capabilities of object storages #24383

hantangwangd · 2025-01-17T05:00:01Z

Currently, object stores become more and more common comparing with HDFS, they have higher scalability, better performance, and support for cloud services. So, enabling Presto Iceberg to leverage the powerful ability of object stores should be a good feature.

As discussed with @tdcmeehan in PR #24221, there may be issues with transaction atomicity and consistency when using HadoopCatalog to directly manage metadata on object stores. Although many people are trying to solve this problem (see: https://lists.apache.org/thread/kh4n98w4z22sc8h2vot4q8n44vdtnltg), especially after the emergence of S3 conditional writes, the current situation is that there are risks involved.

However, referring to the Iceberg community's discussion on the capabilities and limitations of HadoopCatalog (see: https://lists.apache.org/thread/oohcjfp1vpo005h2r0f6gfpsp6op0qps and https://lists.apache.org/thread/v7x65kxrrozwlvsgstobm7685541lf5w), we know that HadoopCatalog is capable of managing metadata on HDFS, thanks to HDFS's strictly support for non-rewritable rename operations. Moreover, some Iceberg users have already used HadoopCatalog to maintain metadata files on HDFS while store data files on object stores, see: https://lists.apache.org/thread/rkg1cnmcl102o8g9ko5l0o152jzgpglm. Therefore, this is a solution that theoretically does not have transaction consistency issues and has been validated in production environments which could leverage object stores.

We can expand the capabilities of out Iceberg Hadoop catalog and achieve the followings:

1. Support for setting independent data write path for Iceberg tables, which is also a native capability supported by Iceberg lib. This way, when creating a table, we can specify an independent location for its actual data, such as a path on S3.
1. Add a configuration parameter for Iceberg Hadoop catalog, such as iceberg.catalog.warehouse.datadir, which represents the default data writing root directory for newly created tables in the entire catalog. If this value is configured, all newly created tables will default to setting their data write path based on this root dir, unless explicitly specified the table's data write path in table creation statement.

After doing these, in production environment, we can configure "iceberg.catalog.warehouse" as a locally deployed HDFS path and "iceberg.catalog.warehouse.datadir" as an S3 path, to safely utilize the powerful storage capabilities of object stores.

Test plan: Build an object storage environment base on MioIO docker, configure iceberg.catalog.warehouse as a local file path, and iceberg.catalog.warehouse.datadir as a s3 path, fully run the tests in IcebergDistributedTestBase, IcebergDistributedSmokeTestBase, and TestIcebergDistributedQueries.

I have already completed the verification in my local workspace, and this is not a big change. Any thoughts or concerns would be greatly appreciated. @tdcmeehan @ZacBlanco @imjalpreet @agrawalreetika @kiersten-stokes

The text was updated successfully, but these errors were encountered:

tdcmeehan · 2025-01-17T15:26:15Z

This sounds like a reasonable approach to me.

hantangwangd added the feature request label Jan 17, 2025

tdcmeehan added the iceberg Apache Iceberg related label Jan 17, 2025

github-project-automation bot moved this to 🆕 Unprioritized in Iceberg Support Jan 17, 2025

github-project-automation bot added this to Iceberg Support Jan 17, 2025

hantangwangd mentioned this issue Jan 18, 2025

[WIP][Iceberg]Support setting warehouse data directory for Hadoop catalog #24397

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling Presto Iceberg to leverage the powerful capabilities of object storages #24383

Enabling Presto Iceberg to leverage the powerful capabilities of object storages #24383

hantangwangd commented Jan 17, 2025 •

edited

Loading

tdcmeehan commented Jan 17, 2025

Enabling Presto Iceberg to leverage the powerful capabilities of object storages #24383

Enabling Presto Iceberg to leverage the powerful capabilities of object storages #24383

Comments

hantangwangd commented Jan 17, 2025 • edited Loading

tdcmeehan commented Jan 17, 2025

hantangwangd commented Jan 17, 2025 •

edited

Loading