Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST Catalog S3 Signer Endpoint should be Catalog specific #11608

Open
1 of 3 tasks
c-thiel opened this issue Nov 20, 2024 · 1 comment
Open
1 of 3 tasks

REST Catalog S3 Signer Endpoint should be Catalog specific #11608

c-thiel opened this issue Nov 20, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@c-thiel
Copy link
Contributor

c-thiel commented Nov 20, 2024

Apache Iceberg version

1.7.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

Currently when configuring two REST catalogs in spark, the s3.signer.uri of the first catalog is used also for the second catalog.

During initial connect to the REST catalog, the catalog may return a s3.signer.uri attribute as part of the overrides of the /v1/config endpoint. This property seems to be set globally for the spark session. Whichever catalog I use first, the sign request for the second catalog is sent to the sign endpoint of the first. Using each catalog separately works perfectly fine.

I tested with one Lakekeeper where we use different sign endpoints for each warehouse as well as with two Nessies. Warehouses share the same bucket but use different path prefixes in my tests.

My spark configuration looks like this:

    "spark.sql.catalog.catalog1": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.catalog1.type": "rest",
    "spark.sql.catalog.catalog1.uri": CATALOG_1_URL,
    "spark.sql.catalog.catalog1.warehouse": "warehouse_1",
    "spark.sql.catalog.catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.catalog.catalog1.s3.remote-signing-enabled": "true",
    "spark.sql.catalog.catalog2": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.catalog2.type": "rest",
    "spark.sql.catalog.catalog2.uri": CATALOG_2_URL,
    "spark.sql.catalog.catalog2.warehouse": "warehouse_2",
    "spark.sql.catalog.catalog2.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.catalog.catalog1.s3.remote-signing-enabled": "true",

If required, I can add a docker compose example as well.
If someone could point me into the right direction, I might be able to create a fix PR.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@c-thiel c-thiel added the bug Something isn't working label Nov 20, 2024
@c-thiel
Copy link
Contributor Author

c-thiel commented Nov 21, 2024

This is not only a problem with spark but at least also affects starrocks.
According to a user on our discord we see the same behavior as I describe for spark above:

I can confirm that both catalogs (lake and lake2) work perfectly fine when set up and used individually in StarRocks. I can create tables, insert data, and query without any issues when only one catalog is active at a time.

However, the problem arises when both catalogs are configured simultaneously. At that point, operations on the second catalog (like INSERT) fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant