Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multithread digest'ing of zarr folders #913

Closed
yarikoptic opened this issue Feb 16, 2022 · 12 comments · Fixed by #923
Closed

multithread digest'ing of zarr folders #913

yarikoptic opened this issue Feb 16, 2022 · 12 comments · Fixed by #923
Assignees
Labels

Comments

@yarikoptic
Copy link
Member

ATM, if I run dandi digest on a hot (was done before, so IO is fast) folder, dandi digest gets just 30% CPU busy and takes > 20 sec, whenever a parallelized example for checksumming takes about x20 times less, and goes above 100% CPU.

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:51:44,202 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185120Z-4990.log

real    0m24.034s
user    0m8.524s
sys     0m5.120s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:52:24,406 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185202Z-5127.log

real    0m22.499s
user    0m8.369s
sys     0m6.190s

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time python /shared/io-utils/fastio_md5.py test64.ngff/0/0/0/0
Total: 7200

real    0m1.358s
user    0m1.545s
sys     0m2.164s

related PR introducing multithreaded walk in fscacher (benefit is not yet 100% clear): con/fscacher#67

@satra
Copy link
Member

satra commented Feb 16, 2022

benefit is not yet 100% clear

isn't 20x less a benefit :)

@yarikoptic
Copy link
Member Author

Ha ha, we don't see that 20x yet in facacher - we might be starting too much or just having it too Pythonic ;-) yet to profile etc

@jwodder
Copy link
Member

jwodder commented Feb 17, 2022

I've created a script at https://github.com/jwodder/zarr-digest-timings for timing different Zarr checksumming methods with different caching configurations. In my initial tests (using large directory trees of small files), just using fscacher (both threaded & non-threaded) has had the worst effect on performance. Please try out the script on any decent sample Zarrs you have.

@jwodder
Copy link
Member

jwodder commented Feb 18, 2022

@yarikoptic @satra Following up on the above: My attempts to do the benchmarking on the Hub were thwarted because my sesssion kept disconnecting, so I spun up a $10/month DigitalOcean droplet and used that. From timing on a directory tree of random data with a layout matching/based on test128.ngff (37480 files, five directory levels deep, with random sizes in the six digits), using five threads (the amount that concurrent.futures would use as the default max_workers on the droplet), I got the following times:

sync fastio oothreads trio recursive
No Caching 171.871 61.8478 67.2708 101.9 154.549
Caching Files 184.005 / 15.9847 94.5188 / 14.863 98.3243 / 16.5179 179.106 / 15.7787
Caching Directories (No Threads) 159.681 / 10.9414 74.77 / 11.4389 80.3404 / 11.9911 114.492 / 11.0248 191.263 / 11.3294
Caching Directories (Threads) 155.546 / 11.5107 73.7246 / 11.1678 83.5654 / 13.6311 115.265 / 11.9566 191.582 / 11.5187
Caching Both (No Threads) 190.551 / 11.3605 110.705 / 11.221 122.877 / 11.8501 223.814 / 10.9694
Caching Both (Threads) 195.624 / 11.7084 114.535 / 11.6021 119.309 / 12.6332 230.242 / 11.5458

(Where caching was involved, two times are shown; the first is the runtime of the initial cache-populating call, and the second is the average runtime of subsequent calls.)

Observations:

  • The fastest checksumming implementation is one based on the fastio/threaded walk code.
  • When Zarr checksumming is cached per directory, getting an already-cached value (for the directory tree in question) always takes about 11 seconds.
  • Whether fscacher is threaded or not doesn't seem to make much of an appreciable difference
  • When checksumming a directory for the first time with an empty cache, caching file digests almost always slows things down

I also tried timing with 20 threads, but that ended up being a bit slower.

@yarikoptic
Copy link
Member Author

if you still have instance available, could you please also provide timing for that /shared/io-utils/fastio_md5.py present on the hub to give some kind of a reference timing I think we should strive to achieve (without caching)

@jwodder
Copy link
Member

jwodder commented Feb 18, 2022

@yarikoptic That's what the "fastio" column is (except it digests files piecemeal and also calculates a final Zarr checksum).

(But if you really want the exact time for that exact script on my sample tree on the droplet, it's 59.799s.)

@yarikoptic
Copy link
Member Author

great -- thank you!

@jwodder
Copy link
Member

jwodder commented Feb 21, 2022

@yarikoptic Question: What's the plan for dealing with the fact that the upload code currently calculates the digest for a Zarr twice? The first time, the digest is used to determine whether to upload the Zarr asset at all (based on --existing) and to populate the digest field in the asset metadata; the second time, it's used to check that the digest recorded in the metadata is still the same and that the server reports the correct value after upload. I recall that this was discussed, but I don't recall a conclusion, and no issue was created. Implementing this issue and #914 would be simpler if we could drop the first digestion.

@yarikoptic
Copy link
Member Author

Well, #915 (was not assigned) was my summary which is inline (I think) with your thinking above -- that the first digesting should be dropped (for now at least).

@yarikoptic
Copy link
Member Author

  • note from Cache directory fingerprint as a XORed hash of file fingerprints con/fscacher#71 (comment) -- timings on cached read improved from 11 sec to 0.3 sec with optimizations to fscacher.
  • fastio seems to be consistently and notably (up to 10%) outperforming OO version of oothreads, so I would prefer to go with fastio
  • I am a little disappointed that Caching Directories (in any form) adds a notable overhead (~20%) to be just enabled by default. But may be it is worth rerunning benchmarking (at least for fastio) with current fscacher to see that may be overhead also shrunk with recent changes?

@jwodder
Copy link
Member

jwodder commented Feb 22, 2022

@yarikoptic

But may be it is worth rerunning benchmarking (at least for fastio) with current fscacher

I already ran those benchmarks; you can see them here, in the rows labelled "PR #71 (xor_bytes)", and the relevant comparisons are here.

@yarikoptic
Copy link
Member Author

zarr64-smallfiles results are a bit confusing, but overall it seems to me that PR #71, 5 threads seems to provide not that much overhead (and somehow at times faster? eg in zarr64-smallfiles, "oothreads" implementation) than baseline, 5 threads, unless I am looking at those results wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants