-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files #11541
Comments
#7914 should address this, it'd be great if someone could take this forward. |
@amogh-jahagirdar I don't think #7914 is a good approach to addressing this as it's not scalable. This trades off distributed listing for single iteration. To support prefix listing, we need to break up the key space in a way that we can parallelize the work. I'll comment on the PR, there's a better approach that we're looking into. |
I'm happy to take a look, independently and/or with community input, I'll circle-back in the coming days with any progress. |
I have a few questions to help get started.
Is the idea to branch in DeleteOrphanFilesSparkAction or in some other more general location(s)?
I see |
I think you could put new tests in here, but this may be a bit difficult to actually test since this will be a side effect within the action. You'll probably need to instrument the branches (FileIO vs current implementation) and then check which one is called.
Just DeleteOrphanFiles, there isn't another location that is using these classes I believe. The base actions are Hadoop agnostic.
The idea is in places like https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L311 We would do a check of if (table.io() instanceof SupportsPrefixOperations) {
// listDirRecursivelyFileIO( location, predicate, .....)
} else {
// list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have
// less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver
listDirRecursively( //Probably rename this function to listDirRecursivelyHadoop
location,
predicate,
hadoopConf.value(),
MAX_DRIVER_LISTING_DEPTH,
MAX_DRIVER_LISTING_DIRECT_SUB_DIRS,
subDirs,
pathFilter,
matchingFiles);
} I haven't checked the rest of the code, but basically anywhere you see org.apache.hadoop classes we should check if the tableIO supports prefixOperations and if it does, branch and do something with prefix operations. |
Feature Request / Improvement
Currently RemoveOrphan Files is hardwired to use Hadoop FS classes when listing directories and attempting to locate orphan files. This is a problem for anyone working with a REST catalog since only the Catalog IO (Iceberg's FIleIO) will have credentials and the Hadoop conf will not.
Luckily, I think we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes.
I propose we branch whenever a Hadoop class is about to be used and use FileIO instead if it supports the proper operations. If it does not we should fall back to Hadoop classes.
Query engine
Spark
Willingness to contribute
The text was updated successfully, but these errors were encountered: