-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do a multithreaded walk when fingerprinting directories #67
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #67 +/- ##
==========================================
+ Coverage 90.86% 91.91% +1.04%
==========================================
Files 4 6 +2
Lines 438 532 +94
Branches 56 71 +15
==========================================
+ Hits 398 489 +91
Misses 24 24
- Partials 16 19 +3
Continue to review full report at Codecov.
|
@yarikoptic All of the benchmarks got worse. |
Interesting. May be overhead is making it not worthwhile whenever tree isn't heavy enough? Could you do timing on a sample zarr? Also curious what would be CPU utilization in both old and threaded versions |
@jwodder - interesting. what kind of file tree did you run the benchmarks on? and did you sweep through a number of threads? even for 260k files on a parallel filesystem 20-30 were optimal. |
@yarikoptic @satra The benchmarks measure fingerprinting of two different directory trees, one containing 100 files, the other containing 3 directories which each contain 3 directories which each contain 3 directories which each contain 3 files. I only used the default thread value of 60. What dimensions for a directory tree do you recommend? |
thank you @jwodder - for that kind of tree i think the thread overhead is large. i suspect your benchmarks will change if you drop default thread count to less than 5. regarding benchmarking (not testing), i think file trees should scale perhaps up to a million items for now, perhaps in factors of 10 starting at 100. the other dimension of tree depth can perhaps go to 8. since this is a generic tool, and not specific to any particular format or file organization, some specific trees may be sufficient to work with say a 100 item tree at a shallow depth and a 100k item tree at a larger depth. i would suggest making up some random tree. a benchmark evaluation matrix could be: tree depth, items, item-spread, threads - but for now i think it's really about perhaps a small and large setting to see if there is a difference. an additional note. you can use the walk to first evaluate numbers of items before caching, and then set the number of threads roughly based on the number of items and cores available on the machine. i know python limited the number of threads to min(32, numcores) in one of the recent updates (3.9 i think) to concurrent futures. i don't know the provenance of that decision. |
just noticed that these benchmarks are running on github, can you control the number of cores or is it essentially queueing up threads? |
Could you describe that as a sequence of integers, where each integer is the number of directories (or files, for the last integer) beneath each directory from the previous level?
I'm not clear on what you're suggesting here. How is it supposed to set the number of threads after walking?
No. |
since this doesn't have to be exact science, i would say something like: directories_per_level = [...] and split those files evenly across the directories. you can decide if the directories or files should be top heavy or bottom heavy.
some empirical function that you create based on benchmarking on multicore machines (so not github actions; you could use dandihub as a resource). since a walk is significantly faster than checksumming for large trees. also this would depend on the checksum being used as that itself could be multithreaded (e.g., blake*, dandi-etag). |
@satra I want a specific example of a Zarr-like directory tree so that I have something to test on that resembles what this library is going to be used for. Regarding your second point, are you saying that each benchmark should have its number of threads set to a hardcoded, fine-tuned number? That wouldn't reflect real-world usage of this library. |
you can check out the trees on dandi hub. however, this library gets used in datalad as well, which is much more general purpose and thus as a library also has to handle more general use cases i think.
the benchmark itself should evaluate across a set of threads: |
No, it doesn't. Only dandi uses fscacher. |
f9fb001
to
7ec27c1
Compare
After merging in #71, this PR turns out to be slower than not using threads, both in the benchmarks (q.v.) and when running on a tree of 37480 files:
|
that is odd. yet to look in detail (e.g. run py-spy on it to see where it spends time etc), but already got surprised that e.g. this https://github.com/con/fscacher/pull/67/files#diff-d98f0230cd2e7e909d05c5c7a9cc6b2a6441b4b75303080ffec12edb3723ed21R74 block is not test covered -- so there is no test which has subdirectories (where I think that should excercise that code path). also may be |
@yarikoptic Correct, there were no tests with subdirectories, though there were benchmarks with subdirectories. I've added a test to cover that spot. Also, |
@yarikoptic The benchmark workflow is currently failing because it's running out of space on the VFAT file system while creating sample directories. I don't know why this would be happening now, as I didn't touch that code. |
@jwodder -- please rewrite last commit and repush to see if "out of space" persists. If it does -- compare with behavior on master -- if it doesn't exhibit there -- must be due to changes here. Bisect or analyze on what lead to that and how could be mitigated. |
@yarikoptic I fixed the "out of space" problem, and now the benchmarks are failing because the new code is just slower. |
eh, indeed up to 6 times slowerbefore after ratio
[3b[19](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:19)9d64] [4f40212d]
<bm/merge-target> <bm/pr>
+ 2.62±0.03ms 11.6±3ms 4.44 cache.TimeDeepDirectory.time_cache('hit', '.')
+ 5.56±0.3ms 22.9±0.4ms 4.12 cache.TimeDeepDirectory.time_cache('hit', '/tmp/nfs')
+ 1.94±0.04ms 6.75±2ms 3.47 cache.TimeDeepDirectory.time_cache('hit', '/tmp/vfat')
653±3μs 673±4μs 1.03 cache.TimeDeepDirectory.time_cache('ignore', '.')
8.[20](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:20)±0.4ms 7.93±0.1ms 0.97 cache.TimeDeepDirectory.time_cache('ignore', '/tmp/nfs')
9[21](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:21)±10μs 925±3μs 1.00 cache.TimeDeepDirectory.time_cache('ignore', '/tmp/vfat')
+ 33.6±0.1ms 204±5ms 6.06 cache.TimeDeepDirectory.time_cache('populate', '.')
+ 73.4±2ms 298±9ms 4.05 cache.TimeDeepDirectory.time_cache('populate', '/tmp/nfs')
+ 38.3±0.7ms 208±9ms 5.42 cache.TimeDeepDirectory.time_cache('populate', '/tmp/vfat')
658±9μs 672±20μs 1.02 cache.TimeFile.time_cache('hit')
10.2±0.03ms 10.2±0.06ms 1.00 cache.TimeFile.time_cache('ignore')
10.8±0.1ms 10.8±0.06ms 1.00 cache.TimeFile.time_cache('populate')
+ 1.62±0.02ms 6.66±0.4ms 4.11 cache.TimeFlatDirectory.time_cache('hit', '.')
+ 1.79±0.02ms 7.[22](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:22)±0.5ms 4.03 cache.TimeFlatDirectory.time_cache('hit', '/tmp/nfs')
+ 1.45±0.01ms 6.44±0.2ms 4.45 cache.TimeFlatDirectory.time_cache('hit', '/tmp/vfat')
329±1μs 339±3μs 1.03 cache.TimeFlatDirectory.time_cache('ignore', '.')
568±20μs 552±10μs 0.97 cache.TimeFlatDirectory.time_cache('ignore', '/tmp/nfs')
394±2μs 411±10μs 1.04 cache.TimeFlatDirectory.time_cache('ignore', '/tmp/vfat')
+ 2.56±0.02ms 8.26±0.3ms 3.[23](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:23) cache.TimeFlatDirectory.time_cache('populate', '.')
+ 4.37±0.1ms 10.0±0.4ms 2.[29](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:29) cache.TimeFlatDirectory.time_cache('populate', '/tmp/nfs')
+ 3.[34](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:34)±0.07ms 8.07±0.1ms 2.42 cache.TimeFlatDirectory.time_cache('populate', '/tmp/vfat') which is odd... since timings in
|
Closes #66.