Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TableOperations.locationProvider is not respected by Spark #11527

Open
3 tasks
jamesbornholt opened this issue Nov 12, 2024 · 2 comments
Open
3 tasks

TableOperations.locationProvider is not respected by Spark #11527

jamesbornholt opened this issue Nov 12, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jamesbornholt
Copy link

Apache Iceberg version

1.6.1

Query engine

Spark

Please describe the bug 🐞

I have a custom Catalog implementation that overrides TableOperations.locationProvider. But running DML queries with Spark doesn't seem to invoke that overriden method. Instead it calls LocationProviders.locationsFor (the BaseMetastoreTableOperations implementation of locationProvider) directly:

this.lazyLocationProvider = LocationProviders.locationsFor(location, properties);

through this stack trace when running something like spark.sql("INSERT INTO ..."):

org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:42)
org.apache.iceberg.SerializableTable.locationProvider(SerializableTable.java:244)
org.apache.iceberg.io.OutputFileFactory$Builder.build(OutputFileFactory.java:177)
org.apache.iceberg.spark.source.SparkWrite$WriterFactory.createWriter(SparkWrite.java:681)
org.apache.iceberg.spark.source.SparkWrite$WriterFactory.createWriter(SparkWrite.java:668)
org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:441)
org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:430)
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:496)
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:393)

It looks like this may have broken in #9029 (@przemekd, @aokolnychyi) -- I tried reverting that change and my locationProvider is now invoked as expected.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@jamesbornholt jamesbornholt added the bug Something isn't working label Nov 12, 2024
@przemekd
Copy link
Contributor

@jamesbornholt I see. It looks like that PR fixed one issue but also introduced new one. I was unaware that you can provide custom default location provider by implementing custom Catalog. But I still argue the spark job shouldn't fail if it just reads data from tables without providing that custom location provider implementation.
I can change the code to use:

table.locationProvider()

for getting the table's location provider, but instead of letting it possibly failing right away we could wrap that into a custom Result class such as:

import java.io.Serializable;

public class Result<T> implements Serializable {
    private static final long serialVersionUID = 1L;

    private final T value;
    private final Exception error;

    private Result(T value, Exception error) {
        this.value = value;
        this.error = error;
    }

    public static <T> Result<T> success(T value) {
        return new Result<>(value, null);
    }

    public static <T> Result<T> failure(Exception error) {
        return new Result<>(null, error);
    }

    public boolean isSuccess() {
        return error == null;
    }

    public T getValue() {
        if (error != null) {
            throw new IllegalStateException("Cannot get value from a failed result", error);
        }
        return value;
    }

    public Exception getError() {
        if (error == null) {
            throw new IllegalStateException("No error present in a successful result");
        }
        return error;
    }
}

and it will get only unpacked when the location provider is truly needed.
Let me know what you think about that @jamesbornholt @aokolnychyi

@jamesbornholt
Copy link
Author

I agree it’s desirable to allow read queries to succeed without the custom location provider available. I like the idea of just deferring the exception like this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants