Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[common] A FileIO API to list files iteratively #4834

Merged
merged 7 commits into from
Jan 17, 2025

Conversation

smdsbz
Copy link
Contributor

@smdsbz smdsbz commented Jan 3, 2025

Purpose

Linked issue: close #4791

Proposing FileIO#listFilesPaged and its non-iterative sibling FileIO#listFiles .

Tests

  • FileIOTest#testListFiles
  • FileIOTest#testListFilesPaged

API and Format

Adding listFilesPaged method to FileIO interface.

Documentation

As described in the methods' JavaDocs.

@smdsbz smdsbz marked this pull request as ready for review January 3, 2025 10:01
default Pair<FileStatus[], String> listFilesPaged(
Path path, boolean recursive, long pageSize, @Nullable String continuationToken)
throws IOException {
FileStatus[] all = listFiles(path, recursive);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #4791, it says that :"As a consequence callers of FileIO, e.g. ObjectRefresh, can only choose to load the entire catalog of files into memory, which may lead to poor performance and OOM."

If apply this pr, here call listFiles, It still keep all the results of the listFiles in memory, and I understand that it cannot solve the OOM problem. What's your opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this PR is to agree on the proposed paged-list API, along with a functionally-correct default implementation. I intend to submit tailored implementations of various stores in separate PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@smdsbz
Copy link
Contributor Author

smdsbz commented Jan 7, 2025

UT failed due to a bug described in #4849, which is irrelevant to this PR.

@smdsbz smdsbz force-pushed the paged-list branch 2 times, most recently from e25be40 to 6f78884 Compare January 13, 2025 07:27
default Pair<FileStatus[], String> listFilesPaged(
Path path, boolean recursive, long pageSize, @Nullable String continuationToken)
throws IOException {
FileStatus[] all = listFiles(path, recursive);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we can throw UnsupportedException for this method by default.

Because, it looks like this default implementation has side effects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid point, someone may stumble on it and unconsciously firing full-fat listFileses for each page.

I'm erasing the default impl for listFilesPaged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding supportsListFilesPaged for testing the availability of listFilesPaged.

@smdsbz smdsbz force-pushed the paged-list branch 2 times, most recently from a925e75 to ae91470 Compare January 14, 2025 05:04
* page. The continuation token will be <code>null</code> if the returned page is the last
* page.
*/
default Pair<FileStatus[], String> listFilesPaged(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a listFileIterator is OK for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'm thinking on reworking the definition of this API to

default Iterator<FileStatus> listFilesIterative(Path path, boolean recursive) {
  // a default implementation that stores all files in the returned Iterator

  // implementors may return Iterators that internally maintain pages
  // the preferred page size is to be determined when FileIO is configure()-ed
}

listFiles will be preserved as a handy helper to unpack the result of listFilesIterative.

The hinting method supportsListFilesPaged will be removed. (Maybe propose something like FileIO#hasFeature in the future, but that deserves a separate PR.)

What's your take?


@Override
public FileStatus next() {
return files.remove();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be lazily using Iterator?
In hasNext or next, we do really listStatus for sub directories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a FileStatusIterator interface to allow hasNext and next throwing IOException while lazy listing.

@smdsbz smdsbz force-pushed the paged-list branch 2 times, most recently from 5cbd2c5 to e616407 Compare January 15, 2025 04:55
import java.io.IOException;

/** An iterator for lazily listing {@link FileStatus}. */
public interface FileStatusIterator extends Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a class like org.apache.hadoop.fs.RemoteIterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Templating FileStatusIterator to o.a.p.fs.RemoteIterator<E>

*
* @throws IOException - if failed to close the iterator
*/
void close() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we must have this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO It is possible that a future implementation would rely on contextual resources. It's better we enforce the resource paradigm since day one.

@JingsongLi
Copy link
Contributor

+1

@JingsongLi JingsongLi merged commit b152608 into apache:master Jan 17, 2025
12 checks passed
@smdsbz smdsbz deleted the paged-list branch January 17, 2025 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] A FileIO API to list files iteratively
3 participants