Performance: Use file indexer when scanning with file source #3333
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR alters
file_source.go
to use a new file indexer, rather than the existing directory indexer.Currently, when scanning a non-archive file, file_source.go applies a filter function to the directory indexer such that all files other than the file being scanned and its parent directory are ignored by the directory indexer. See here.
This approach becomes problematic when the scanned file is inside a directory with a large number of files, for two reasons:
This Pprof shows heap allocation when scanning a file within a directory containing a large number of files, I’m including it here as proof of my root cause analysis
Walking all of the files in the containing directory is redundant when using a file source, since as mentioned above the filter function will ignore everything other than the scanned file and its parent dir.
In this change, I have added a new file indexer which should match the existing behaviour of the directory indexer for a single file source. However, instead of walking the file system, it simply makes an attempt to index the containing directory and the file target.
I have also added
file.go
to satisfy the resolver interface when using the file indexer. Much of the functionality matches that ofdirectory.go
and I would appreciate it if there are any suggestions for improvement here, as I appreciate there's a bit of duplicated code.The existing
directory.go
has many unit tests to verify behaviour in the event that the directory being walked contains symlinks etc. I have attempted to simplify the unit tests forfile.go
as it does not have to handle all of the complexity thatdirectory.go
does, but I would really appreciate extra review attention in this area as I may not be aware of all the ways a target for file_source may be defined.I haven’t got a pprof diagram for the new approach, but memstat profiling has shown O(1) heap use wrt the number of files in the containing directory when using file source as expected.
Additionally, creating a resolver via a file_source is also happening in O(1) time wrt the number of files in the containing directory too.
Type of change
Checklist: