Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do a multithreaded walk when fingerprinting directories #67

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Feb 9, 2022

Closes #66.

@jwodder jwodder added the patch Increment the patch version when merged label Feb 9, 2022
@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2022

Codecov Report

Merging #67 (074f49b) into master (3b199d6) will increase coverage by 1.04%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master      #67      +/-   ##
==========================================
+ Coverage   90.86%   91.91%   +1.04%     
==========================================
  Files           4        6       +2     
  Lines         438      532      +94     
  Branches       56       71      +15     
==========================================
+ Hits          398      489      +91     
  Misses         24       24              
- Partials       16       19       +3     
Impacted Files Coverage Δ
src/fscacher/fastio.py 89.28% <89.28%> (ø)
src/fscacher/util.py 91.83% <91.83%> (ø)
src/fscacher/cache.py 85.10% <100.00%> (+0.64%) ⬆️
src/fscacher/tests/test_cache.py 94.06% <100.00%> (+0.19%) ⬆️
src/fscacher/tests/test_util.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b199d6...074f49b. Read the comment docs.

@jwodder
Copy link
Member Author

jwodder commented Feb 9, 2022

@yarikoptic All of the benchmarks got worse.

@yarikoptic
Copy link
Member

Interesting. May be overhead is making it not worthwhile whenever tree isn't heavy enough? Could you do timing on a sample zarr? Also curious what would be CPU utilization in both old and threaded versions

@satra
Copy link

satra commented Feb 9, 2022

@jwodder - interesting. what kind of file tree did you run the benchmarks on? and did you sweep through a number of threads? even for 260k files on a parallel filesystem 20-30 were optimal.

@jwodder
Copy link
Member Author

jwodder commented Feb 10, 2022

@yarikoptic @satra The benchmarks measure fingerprinting of two different directory trees, one containing 100 files, the other containing 3 directories which each contain 3 directories which each contain 3 directories which each contain 3 files. I only used the default thread value of 60.

What dimensions for a directory tree do you recommend?

@satra
Copy link

satra commented Feb 10, 2022

thank you @jwodder - for that kind of tree i think the thread overhead is large. i suspect your benchmarks will change if you drop default thread count to less than 5.

regarding benchmarking (not testing), i think file trees should scale perhaps up to a million items for now, perhaps in factors of 10 starting at 100. the other dimension of tree depth can perhaps go to 8.

since this is a generic tool, and not specific to any particular format or file organization, some specific trees may be sufficient to work with say a 100 item tree at a shallow depth and a 100k item tree at a larger depth.

i would suggest making up some random tree. a benchmark evaluation matrix could be: tree depth, items, item-spread, threads - but for now i think it's really about perhaps a small and large setting to see if there is a difference.

an additional note. you can use the walk to first evaluate numbers of items before caching, and then set the number of threads roughly based on the number of items and cores available on the machine. i know python limited the number of threads to min(32, numcores) in one of the recent updates (3.9 i think) to concurrent futures. i don't know the provenance of that decision.

@satra
Copy link

satra commented Feb 10, 2022

just noticed that these benchmarks are running on github, can you control the number of cores or is it essentially queueing up threads?

@jwodder
Copy link
Member Author

jwodder commented Feb 10, 2022

@satra

regarding benchmarking (not testing), i think file trees should scale perhaps up to a million items for now, perhaps in factors of 10 starting at 100. the other dimension of tree depth can perhaps go to 8.

Could you describe that as a sequence of integers, where each integer is the number of directories (or files, for the last integer) beneath each directory from the previous level?

you can use the walk to first evaluate numbers of items before caching, and then set the number of threads roughly based on the number of items and cores available on the machine

I'm not clear on what you're suggesting here. How is it supposed to set the number of threads after walking?

just noticed that these benchmarks are running on github, can you control the number of cores

No.

@satra
Copy link

satra commented Feb 10, 2022

Could you describe that as a sequence of integers, where each integer is the number of directories (or files, for the last integer) beneath each directory from the previous level?

since this doesn't have to be exact science, i would say something like:

directories_per_level = [...]
files_per_level = [...]

and split those files evenly across the directories. you can decide if the directories or files should be top heavy or bottom heavy.

I'm not clear on what you're suggesting here. How is it supposed to set the number of threads after walking?

some empirical function that you create based on benchmarking on multicore machines (so not github actions; you could use dandihub as a resource). since a walk is significantly faster than checksumming for large trees. also this would depend on the checksum being used as that itself could be multithreaded (e.g., blake*, dandi-etag).

@jwodder
Copy link
Member Author

jwodder commented Feb 10, 2022

@satra I want a specific example of a Zarr-like directory tree so that I have something to test on that resembles what this library is going to be used for.

Regarding your second point, are you saying that each benchmark should have its number of threads set to a hardcoded, fine-tuned number? That wouldn't reflect real-world usage of this library.

@satra
Copy link

satra commented Feb 10, 2022

I want a specific example of a Zarr-like directory tree

you can check out the trees on dandi hub. /shared/ngffdata/test[64,128]

however, this library gets used in datalad as well, which is much more general purpose and thus as a library also has to handle more general use cases i think.

Regarding your second point, are you saying that each benchmark should have its number of threads set to a hardcoded, fine-tuned number? That wouldn't reflect real-world usage of this library.

the benchmark itself should evaluate across a set of threads: 1, 5, 10, 20, 50. based on the benchmark evaluation a ruleset could be built to support decision making.

@jwodder
Copy link
Member Author

jwodder commented Feb 10, 2022

@satra

however, this library gets used in datalad as well

No, it doesn't. Only dandi uses fscacher.

@jwodder
Copy link
Member Author

jwodder commented Feb 22, 2022

After merging in #71, this PR turns out to be slower than not using threads, both in the benchmarks (q.v.) and when running on a tree of 37480 files:

fscacher branch Cache Miss Cache Hit
PR #67 w/xor_bytes, 5 threads 64.4942 0.600201
PR #67 w/xor_bytes, 10 threads 62.9912 0.530866
PR #67 w/xor_bytes, 20 threads 63.5175 0.558002
PR #67 w/xor_bytes, 32 threads 65.1746 0.543029
PR #67 w/xor_bytes, 60 threads 62.39 0.549617
PR #71 63.8336 0.326017

@yarikoptic
Copy link
Member

that is odd. yet to look in detail (e.g. run py-spy on it to see where it spends time etc), but already got surprised that e.g. this https://github.com/con/fscacher/pull/67/files#diff-d98f0230cd2e7e909d05c5c7a9cc6b2a6441b4b75303080ffec12edb3723ed21R74 block is not test covered -- so there is no test which has subdirectories (where I think that should excercise that code path). also may be sorted could be avoided? ;-)

@jwodder
Copy link
Member Author

jwodder commented Feb 22, 2022

@yarikoptic Correct, there were no tests with subdirectories, though there were benchmarks with subdirectories. I've added a test to cover that spot. Also, sorted() has now been removed.

@jwodder
Copy link
Member Author

jwodder commented Feb 22, 2022

@yarikoptic The benchmark workflow is currently failing because it's running out of space on the VFAT file system while creating sample directories. I don't know why this would be happening now, as I didn't touch that code.

@yarikoptic
Copy link
Member

@jwodder -- please rewrite last commit and repush to see if "out of space" persists. If it does -- compare with behavior on master -- if it doesn't exhibit there -- must be due to changes here. Bisect or analyze on what lead to that and how could be mitigated.

@jwodder
Copy link
Member Author

jwodder commented Mar 28, 2022

@yarikoptic I fixed the "out of space" problem, and now the benchmarks are failing because the new code is just slower.

@yarikoptic
Copy link
Member

eh, indeed up to 6 times slower
before           after         ratio
     [3b[19](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:19)9d64]       [4f40212d]
     <bm/merge-target>       <bm/pr>   
+     2.62±0.03ms         11.6±3ms     4.44  cache.TimeDeepDirectory.time_cache('hit', '.')
+      5.56±0.3ms       22.9±0.4ms     4.12  cache.TimeDeepDirectory.time_cache('hit', '/tmp/nfs')
+     1.94±0.04ms         6.75±2ms     3.47  cache.TimeDeepDirectory.time_cache('hit', '/tmp/vfat')
          653±3μs          673±4μs     1.03  cache.TimeDeepDirectory.time_cache('ignore', '.')
       8.[20](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:20)±0.4ms       7.93±0.1ms     0.97  cache.TimeDeepDirectory.time_cache('ignore', '/tmp/nfs')
         9[21](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:21)±10μs          925±3μs     1.00  cache.TimeDeepDirectory.time_cache('ignore', '/tmp/vfat')
+      33.6±0.1ms          204±5ms     6.06  cache.TimeDeepDirectory.time_cache('populate', '.')
+        73.4±2ms          298±9ms     4.05  cache.TimeDeepDirectory.time_cache('populate', '/tmp/nfs')
+      38.3±0.7ms          208±9ms     5.42  cache.TimeDeepDirectory.time_cache('populate', '/tmp/vfat')
          658±9μs         672±20μs     1.02  cache.TimeFile.time_cache('hit')
      10.2±0.03ms      10.2±0.06ms     1.00  cache.TimeFile.time_cache('ignore')
       10.8±0.1ms      10.8±0.06ms     1.00  cache.TimeFile.time_cache('populate')
+     1.62±0.02ms       6.66±0.4ms     4.11  cache.TimeFlatDirectory.time_cache('hit', '.')
+     1.79±0.02ms       7.[22](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:22)±0.5ms     4.03  cache.TimeFlatDirectory.time_cache('hit', '/tmp/nfs')
+     1.45±0.01ms       6.44±0.2ms     4.45  cache.TimeFlatDirectory.time_cache('hit', '/tmp/vfat')
          329±1μs          339±3μs     1.03  cache.TimeFlatDirectory.time_cache('ignore', '.')
         568±20μs         552±10μs     0.97  cache.TimeFlatDirectory.time_cache('ignore', '/tmp/nfs')
          394±2μs         411±10μs     1.04  cache.TimeFlatDirectory.time_cache('ignore', '/tmp/vfat')
+     2.56±0.02ms       8.26±0.3ms     3.[23](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:23)  cache.TimeFlatDirectory.time_cache('populate', '.')
+      4.37±0.1ms       10.0±0.4ms     2.[29](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:29)  cache.TimeFlatDirectory.time_cache('populate', '/tmp/nfs')
+     3.[34](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:34)±0.07ms       8.07±0.1ms     2.42  cache.TimeFlatDirectory.time_cache('populate', '/tmp/vfat')

which is odd... since timings in ms - it is unlikely a "threading setup" time overhead, and might be worth

  • analyses (e.g. with py-spy) on what actually consumes time while walking some sizeable tree
  • comparing locally on a more realistic larger use-case? (pretty much adjusting the tree parameters so it would take seconds, and then do comparison as in the previous item)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

multithread traversal of directories
4 participants