Do a multithreaded walk when fingerprinting directories #67

jwodder · 2022-02-09T21:27:05Z

Closes #66.

codecov-commenter · 2022-02-09T21:30:49Z

Codecov Report

Merging #67 (074f49b) into master (3b199d6) will increase coverage by 1.04%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master      #67      +/-   ##
==========================================
+ Coverage   90.86%   91.91%   +1.04%     
==========================================
  Files           4        6       +2     
  Lines         438      532      +94     
  Branches       56       71      +15     
==========================================
+ Hits          398      489      +91     
  Misses         24       24              
- Partials       16       19       +3

Impacted Files	Coverage Δ
src/fscacher/fastio.py	`89.28% <89.28%> (ø)`
src/fscacher/util.py	`91.83% <91.83%> (ø)`
src/fscacher/cache.py	`85.10% <100.00%> (+0.64%)`	⬆️
src/fscacher/tests/test_cache.py	`94.06% <100.00%> (+0.19%)`	⬆️
src/fscacher/tests/test_util.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b199d6...074f49b. Read the comment docs.

jwodder · 2022-02-09T21:41:08Z

@yarikoptic All of the benchmarks got worse.

yarikoptic · 2022-02-09T21:56:39Z

Interesting. May be overhead is making it not worthwhile whenever tree isn't heavy enough? Could you do timing on a sample zarr? Also curious what would be CPU utilization in both old and threaded versions

satra · 2022-02-09T22:49:07Z

@jwodder - interesting. what kind of file tree did you run the benchmarks on? and did you sweep through a number of threads? even for 260k files on a parallel filesystem 20-30 were optimal.

jwodder · 2022-02-10T13:52:23Z

@yarikoptic @satra The benchmarks measure fingerprinting of two different directory trees, one containing 100 files, the other containing 3 directories which each contain 3 directories which each contain 3 directories which each contain 3 files. I only used the default thread value of 60.

What dimensions for a directory tree do you recommend?

satra · 2022-02-10T15:15:46Z

thank you @jwodder - for that kind of tree i think the thread overhead is large. i suspect your benchmarks will change if you drop default thread count to less than 5.

regarding benchmarking (not testing), i think file trees should scale perhaps up to a million items for now, perhaps in factors of 10 starting at 100. the other dimension of tree depth can perhaps go to 8.

since this is a generic tool, and not specific to any particular format or file organization, some specific trees may be sufficient to work with say a 100 item tree at a shallow depth and a 100k item tree at a larger depth.

i would suggest making up some random tree. a benchmark evaluation matrix could be: tree depth, items, item-spread, threads - but for now i think it's really about perhaps a small and large setting to see if there is a difference.

an additional note. you can use the walk to first evaluate numbers of items before caching, and then set the number of threads roughly based on the number of items and cores available on the machine. i know python limited the number of threads to min(32, numcores) in one of the recent updates (3.9 i think) to concurrent futures. i don't know the provenance of that decision.

satra · 2022-02-10T15:17:05Z

just noticed that these benchmarks are running on github, can you control the number of cores or is it essentially queueing up threads?

jwodder · 2022-02-10T15:27:16Z

@satra

regarding benchmarking (not testing), i think file trees should scale perhaps up to a million items for now, perhaps in factors of 10 starting at 100. the other dimension of tree depth can perhaps go to 8.

Could you describe that as a sequence of integers, where each integer is the number of directories (or files, for the last integer) beneath each directory from the previous level?

you can use the walk to first evaluate numbers of items before caching, and then set the number of threads roughly based on the number of items and cores available on the machine

I'm not clear on what you're suggesting here. How is it supposed to set the number of threads after walking?

just noticed that these benchmarks are running on github, can you control the number of cores

No.

satra · 2022-02-10T15:35:27Z

Could you describe that as a sequence of integers, where each integer is the number of directories (or files, for the last integer) beneath each directory from the previous level?

since this doesn't have to be exact science, i would say something like:

directories_per_level = [...]
files_per_level = [...]

and split those files evenly across the directories. you can decide if the directories or files should be top heavy or bottom heavy.

I'm not clear on what you're suggesting here. How is it supposed to set the number of threads after walking?

some empirical function that you create based on benchmarking on multicore machines (so not github actions; you could use dandihub as a resource). since a walk is significantly faster than checksumming for large trees. also this would depend on the checksum being used as that itself could be multithreaded (e.g., blake*, dandi-etag).

jwodder · 2022-02-10T15:41:51Z

@satra I want a specific example of a Zarr-like directory tree so that I have something to test on that resembles what this library is going to be used for.

Regarding your second point, are you saying that each benchmark should have its number of threads set to a hardcoded, fine-tuned number? That wouldn't reflect real-world usage of this library.

satra · 2022-02-10T15:57:05Z

I want a specific example of a Zarr-like directory tree

you can check out the trees on dandi hub. /shared/ngffdata/test[64,128]

however, this library gets used in datalad as well, which is much more general purpose and thus as a library also has to handle more general use cases i think.

Regarding your second point, are you saying that each benchmark should have its number of threads set to a hardcoded, fine-tuned number? That wouldn't reflect real-world usage of this library.

the benchmark itself should evaluate across a set of threads: 1, 5, 10, 20, 50. based on the benchmark evaluation a ruleset could be built to support decision making.

jwodder · 2022-02-10T15:59:34Z

@satra

however, this library gets used in datalad as well

No, it doesn't. Only dandi uses fscacher.

jwodder · 2022-02-22T15:19:54Z

After merging in #71, this PR turns out to be slower than not using threads, both in the benchmarks (q.v.) and when running on a tree of 37480 files:

fscacher branch	Cache Miss	Cache Hit
PR #67 w/`xor_bytes`, 5 threads	64.4942	0.600201
PR #67 w/`xor_bytes`, 10 threads	62.9912	0.530866
PR #67 w/`xor_bytes`, 20 threads	63.5175	0.558002
PR #67 w/`xor_bytes`, 32 threads	65.1746	0.543029
PR #67 w/`xor_bytes`, 60 threads	62.39	0.549617
PR #71	63.8336	0.326017

yarikoptic · 2022-02-22T19:58:09Z

that is odd. yet to look in detail (e.g. run py-spy on it to see where it spends time etc), but already got surprised that e.g. this https://github.com/con/fscacher/pull/67/files#diff-d98f0230cd2e7e909d05c5c7a9cc6b2a6441b4b75303080ffec12edb3723ed21R74 block is not test covered -- so there is no test which has subdirectories (where I think that should excercise that code path). also may be sorted could be avoided? ;-)

jwodder · 2022-02-22T20:39:30Z

@yarikoptic Correct, there were no tests with subdirectories, though there were benchmarks with subdirectories. I've added a test to cover that spot. Also, sorted() has now been removed.

jwodder · 2022-02-22T20:59:37Z

@yarikoptic The benchmark workflow is currently failing because it's running out of space on the VFAT file system while creating sample directories. I don't know why this would be happening now, as I didn't touch that code.

yarikoptic · 2022-03-25T20:35:36Z

@jwodder -- please rewrite last commit and repush to see if "out of space" persists. If it does -- compare with behavior on master -- if it doesn't exhibit there -- must be due to changes here. Bisect or analyze on what lead to that and how could be mitigated.

…3df3986f302>

jwodder · 2022-03-28T16:39:28Z

@yarikoptic I fixed the "out of space" problem, and now the benchmarks are failing because the new code is just slower.

yarikoptic · 2022-03-28T17:00:29Z

eh, indeed up to 6 times slower

before           after         ratio
     [3b[19](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:19)9d64]       [4f40212d]
     <bm/merge-target>       <bm/pr>   
+     2.62±0.03ms         11.6±3ms     4.44  cache.TimeDeepDirectory.time_cache('hit', '.')
+      5.56±0.3ms       22.9±0.4ms     4.12  cache.TimeDeepDirectory.time_cache('hit', '/tmp/nfs')
+     1.94±0.04ms         6.75±2ms     3.47  cache.TimeDeepDirectory.time_cache('hit', '/tmp/vfat')
          653±3μs          673±4μs     1.03  cache.TimeDeepDirectory.time_cache('ignore', '.')
       8.[20](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:20)±0.4ms       7.93±0.1ms     0.97  cache.TimeDeepDirectory.time_cache('ignore', '/tmp/nfs')
         9[21](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:21)±10μs          925±3μs     1.00  cache.TimeDeepDirectory.time_cache('ignore', '/tmp/vfat')
+      33.6±0.1ms          204±5ms     6.06  cache.TimeDeepDirectory.time_cache('populate', '.')
+        73.4±2ms          298±9ms     4.05  cache.TimeDeepDirectory.time_cache('populate', '/tmp/nfs')
+      38.3±0.7ms          208±9ms     5.42  cache.TimeDeepDirectory.time_cache('populate', '/tmp/vfat')
          658±9μs         672±20μs     1.02  cache.TimeFile.time_cache('hit')
      10.2±0.03ms      10.2±0.06ms     1.00  cache.TimeFile.time_cache('ignore')
       10.8±0.1ms      10.8±0.06ms     1.00  cache.TimeFile.time_cache('populate')
+     1.62±0.02ms       6.66±0.4ms     4.11  cache.TimeFlatDirectory.time_cache('hit', '.')
+     1.79±0.02ms       7.[22](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:22)±0.5ms     4.03  cache.TimeFlatDirectory.time_cache('hit', '/tmp/nfs')
+     1.45±0.01ms       6.44±0.2ms     4.45  cache.TimeFlatDirectory.time_cache('hit', '/tmp/vfat')
          329±1μs          339±3μs     1.03  cache.TimeFlatDirectory.time_cache('ignore', '.')
         568±20μs         552±10μs     0.97  cache.TimeFlatDirectory.time_cache('ignore', '/tmp/nfs')
          394±2μs         411±10μs     1.04  cache.TimeFlatDirectory.time_cache('ignore', '/tmp/vfat')
+     2.56±0.02ms       8.26±0.3ms     3.[23](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:23)  cache.TimeFlatDirectory.time_cache('populate', '.')
+      4.37±0.1ms       10.0±0.4ms     2.[29](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:29)  cache.TimeFlatDirectory.time_cache('populate', '/tmp/nfs')
+     3.[34](https://github.com/con/fscacher/runs/5723933622?check_suite_focus=true#step:15:34)±0.07ms       8.07±0.1ms     2.42  cache.TimeFlatDirectory.time_cache('populate', '/tmp/vfat')

which is odd... since timings in ms - it is unlikely a "threading setup" time overhead, and might be worth

analyses (e.g. with py-spy) on what actually consumes time while walking some sizeable tree
comparing locally on a more realistic larger use-case? (pretty much adjusting the tree parameters so it would take seconds, and then do comparison as in the previous item)

jwodder added the patch Increment the patch version when merged label Feb 9, 2022

jwodder force-pushed the gh-66 branch 2 times, most recently from f9fb001 to 7ec27c1 Compare February 14, 2022 22:36

This was referenced Feb 16, 2022

from joblib: UserWarning: Persisting input arguments took XXX to run. #68

Closed

multithread digest'ing of zarr folders dandi/dandi-cli#913

Closed

do not fscache individual files digests for zarr-checksum dandi/dandi-cli#914

Closed

jwodder added performance Improve performance of an existing feature and removed patch Increment the patch version when merged labels Feb 18, 2022

jwodder force-pushed the gh-66 branch from 7f16b96 to 9c106fc Compare February 22, 2022 14:46

jwodder force-pushed the gh-66 branch from e6df537 to f1fe1f5 Compare March 28, 2022 16:07

jwodder added 3 commits March 28, 2022 12:31

Import code from <https://gist.github.com/satra/0e02cd7554672120cf807…

2738036

…3df3986f302>

Adjust for our purposes

a71bd61

Make the number of threads configurable

e291a69

jwodder added 2 commits March 28, 2022 12:31

Don't sort listdir() values

10759d7

Add a test with subdirectories and recursion

074f49b

jwodder force-pushed the gh-66 branch from f1fe1f5 to 074f49b Compare March 28, 2022 16:32

jwodder mentioned this pull request Jul 29, 2022

Enhance threaded_walk with exclusion of vcs and other subfolders dandi/dandi-cli#1086

Open

jwodder mentioned this pull request Mar 20, 2023

embrace pathlib fully in .core? datalad/datalad#4598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do a multithreaded walk when fingerprinting directories #67

Do a multithreaded walk when fingerprinting directories #67

jwodder commented Feb 9, 2022

codecov-commenter commented Feb 9, 2022 •

edited

Loading

jwodder commented Feb 9, 2022

yarikoptic commented Feb 9, 2022

satra commented Feb 9, 2022

jwodder commented Feb 10, 2022

satra commented Feb 10, 2022

satra commented Feb 10, 2022

jwodder commented Feb 10, 2022

satra commented Feb 10, 2022

jwodder commented Feb 10, 2022

satra commented Feb 10, 2022

jwodder commented Feb 10, 2022 •

edited

Loading

jwodder commented Feb 22, 2022 •

edited

Loading

yarikoptic commented Feb 22, 2022

jwodder commented Feb 22, 2022

jwodder commented Feb 22, 2022 •

edited

Loading

yarikoptic commented Mar 25, 2022

jwodder commented Mar 28, 2022

yarikoptic commented Mar 28, 2022

Do a multithreaded walk when fingerprinting directories #67

Are you sure you want to change the base?

Do a multithreaded walk when fingerprinting directories #67

Conversation

jwodder commented Feb 9, 2022

codecov-commenter commented Feb 9, 2022 • edited Loading

Codecov Report

jwodder commented Feb 9, 2022

yarikoptic commented Feb 9, 2022

satra commented Feb 9, 2022

jwodder commented Feb 10, 2022

satra commented Feb 10, 2022

satra commented Feb 10, 2022

jwodder commented Feb 10, 2022

satra commented Feb 10, 2022

jwodder commented Feb 10, 2022

satra commented Feb 10, 2022

jwodder commented Feb 10, 2022 • edited Loading

jwodder commented Feb 22, 2022 • edited Loading

yarikoptic commented Feb 22, 2022

jwodder commented Feb 22, 2022

jwodder commented Feb 22, 2022 • edited Loading

yarikoptic commented Mar 25, 2022

jwodder commented Mar 28, 2022

yarikoptic commented Mar 28, 2022

codecov-commenter commented Feb 9, 2022 •

edited

Loading

jwodder commented Feb 10, 2022 •

edited

Loading

jwodder commented Feb 22, 2022 •

edited

Loading

jwodder commented Feb 22, 2022 •

edited

Loading