-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[common] A FileIO API to list files iteratively #4834
Conversation
default Pair<FileStatus[], String> listFilesPaged( | ||
Path path, boolean recursive, long pageSize, @Nullable String continuationToken) | ||
throws IOException { | ||
FileStatus[] all = listFiles(path, recursive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #4791, it says that :"As a consequence callers of FileIO, e.g. ObjectRefresh, can only choose to load the entire catalog of files into memory, which may lead to poor performance and OOM."
If apply this pr, here call listFiles
, It still keep all the results of the listFiles
in memory, and I understand that it cannot solve the OOM problem. What's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this PR is to agree on the proposed paged-list API, along with a functionally-correct default implementation. I intend to submit tailored implementations of various stores in separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
UT failed due to a bug described in #4849, which is irrelevant to this PR. |
e25be40
to
6f78884
Compare
default Pair<FileStatus[], String> listFilesPaged( | ||
Path path, boolean recursive, long pageSize, @Nullable String continuationToken) | ||
throws IOException { | ||
FileStatus[] all = listFiles(path, recursive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we can throw UnsupportedException for this method by default.
Because, it looks like this default implementation has side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid point, someone may stumble on it and unconsciously firing full-fat listFiles
es for each page.
I'm erasing the default impl for listFilesPaged
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also adding supportsListFilesPaged
for testing the availability of listFilesPaged
.
a925e75
to
ae91470
Compare
* page. The continuation token will be <code>null</code> if the returned page is the last | ||
* page. | ||
*/ | ||
default Pair<FileStatus[], String> listFilesPaged( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think a listFileIterator
is OK for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'm thinking on reworking the definition of this API to
default Iterator<FileStatus> listFilesIterative(Path path, boolean recursive) {
// a default implementation that stores all files in the returned Iterator
// implementors may return Iterators that internally maintain pages
// the preferred page size is to be determined when FileIO is configure()-ed
}
listFiles
will be preserved as a handy helper to unpack the result of listFilesIterative
.
The hinting method supportsListFilesPaged
will be removed. (Maybe propose something like FileIO#hasFeature
in the future, but that deserves a separate PR.)
What's your take?
|
||
@Override | ||
public FileStatus next() { | ||
return files.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be lazily using Iterator
?
In hasNext or next, we do really listStatus
for sub directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a FileStatusIterator
interface to allow hasNext
and next
throwing IOException
while lazy listing.
5cbd2c5
to
e616407
Compare
import java.io.IOException; | ||
|
||
/** An iterator for lazily listing {@link FileStatus}. */ | ||
public interface FileStatusIterator extends Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add a class like org.apache.hadoop.fs.RemoteIterator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Templating FileStatusIterator
to o.a.p.fs.RemoteIterator<E>
* | ||
* @throws IOException - if failed to close the iterator | ||
*/ | ||
void close() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we must have this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO It is possible that a future implementation would rely on contextual resources. It's better we enforce the resource paradigm since day one.
+1 |
Purpose
Linked issue: close #4791
Proposing
FileIO#listFilesPaged
and its non-iterative siblingFileIO#listFiles
.Tests
FileIOTest#testListFiles
FileIOTest#testListFilesPaged
API and Format
Adding
listFilesPaged
method toFileIO
interface.Documentation
As described in the methods' JavaDocs.