multithread digest'ing of zarr folders #913

yarikoptic · 2022-02-16T19:01:15Z

ATM, if I run dandi digest on a hot (was done before, so IO is fast) folder, dandi digest gets just 30% CPU busy and takes > 20 sec, whenever a parallelized example for checksumming takes about x20 times less, and goes above 100% CPU.

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:51:44,202 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185120Z-4990.log

real    0m24.034s
user    0m8.524s
sys     0m5.120s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time DANDI_CACHE=ignore dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 18:52:24,406 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185202Z-5127.log

real    0m22.499s
user    0m8.369s
sys     0m6.190s

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time python /shared/io-utils/fastio_md5.py test64.ngff/0/0/0/0
Total: 7200

real    0m1.358s
user    0m1.545s
sys     0m2.164s

related PR introducing multithreaded walk in fscacher (benefit is not yet 100% clear): con/fscacher#67

The text was updated successfully, but these errors were encountered:

satra · 2022-02-16T21:34:36Z

benefit is not yet 100% clear

isn't 20x less a benefit :)

yarikoptic · 2022-02-17T02:09:37Z

Ha ha, we don't see that 20x yet in facacher - we might be starting too much or just having it too Pythonic ;-) yet to profile etc

jwodder · 2022-02-17T15:21:29Z

I've created a script at https://github.com/jwodder/zarr-digest-timings for timing different Zarr checksumming methods with different caching configurations. In my initial tests (using large directory trees of small files), just using fscacher (both threaded & non-threaded) has had the worst effect on performance. Please try out the script on any decent sample Zarrs you have.

jwodder · 2022-02-18T19:21:29Z

@yarikoptic @satra Following up on the above: My attempts to do the benchmarking on the Hub were thwarted because my sesssion kept disconnecting, so I spun up a $10/month DigitalOcean droplet and used that. From timing on a directory tree of random data with a layout matching/based on test128.ngff (37480 files, five directory levels deep, with random sizes in the six digits), using five threads (the amount that concurrent.futures would use as the default max_workers on the droplet), I got the following times:

	sync	fastio	oothreads	trio	recursive
No Caching	171.871	61.8478	67.2708	101.9	154.549
Caching Files	184.005 / 15.9847	94.5188 / 14.863	98.3243 / 16.5179	—	179.106 / 15.7787
Caching Directories (No Threads)	159.681 / 10.9414	74.77 / 11.4389	80.3404 / 11.9911	114.492 / 11.0248	191.263 / 11.3294
Caching Directories (Threads)	155.546 / 11.5107	73.7246 / 11.1678	83.5654 / 13.6311	115.265 / 11.9566	191.582 / 11.5187
Caching Both (No Threads)	190.551 / 11.3605	110.705 / 11.221	122.877 / 11.8501	—	223.814 / 10.9694
Caching Both (Threads)	195.624 / 11.7084	114.535 / 11.6021	119.309 / 12.6332	—	230.242 / 11.5458

(Where caching was involved, two times are shown; the first is the runtime of the initial cache-populating call, and the second is the average runtime of subsequent calls.)

Observations:

The fastest checksumming implementation is one based on the fastio/threaded walk code.
When Zarr checksumming is cached per directory, getting an already-cached value (for the directory tree in question) always takes about 11 seconds.
Whether fscacher is threaded or not doesn't seem to make much of an appreciable difference
When checksumming a directory for the first time with an empty cache, caching file digests almost always slows things down

I also tried timing with 20 threads, but that ended up being a bit slower.

yarikoptic · 2022-02-18T22:34:12Z

if you still have instance available, could you please also provide timing for that /shared/io-utils/fastio_md5.py present on the hub to give some kind of a reference timing I think we should strive to achieve (without caching)

jwodder · 2022-02-18T22:36:43Z

@yarikoptic That's what the "fastio" column is (except it digests files piecemeal and also calculates a final Zarr checksum).

(But if you really want the exact time for that exact script on my sample tree on the droplet, it's 59.799s.)

yarikoptic · 2022-02-18T22:44:31Z

great -- thank you!

jwodder · 2022-02-21T14:47:23Z

@yarikoptic Question: What's the plan for dealing with the fact that the upload code currently calculates the digest for a Zarr twice? The first time, the digest is used to determine whether to upload the Zarr asset at all (based on --existing) and to populate the digest field in the asset metadata; the second time, it's used to check that the digest recorded in the metadata is still the same and that the server reports the correct value after upload. I recall that this was discussed, but I don't recall a conclusion, and no issue was created. Implementing this issue and #914 would be simpler if we could drop the first digestion.

yarikoptic · 2022-02-21T16:00:15Z

Well, #915 (was not assigned) was my summary which is inline (I think) with your thinking above -- that the first digesting should be dropped (for now at least).

yarikoptic · 2022-02-22T19:08:48Z

note from Cache directory fingerprint as a XORed hash of file fingerprints con/fscacher#71 (comment) -- timings on cached read improved from 11 sec to 0.3 sec with optimizations to fscacher.
fastio seems to be consistently and notably (up to 10%) outperforming OO version of oothreads, so I would prefer to go with fastio
I am a little disappointed that Caching Directories (in any form) adds a notable overhead (~20%) to be just enabled by default. But may be it is worth rerunning benchmarking (at least for fastio) with current fscacher to see that may be overhead also shrunk with recent changes?

jwodder · 2022-02-22T19:34:53Z

@yarikoptic

But may be it is worth rerunning benchmarking (at least for fastio) with current fscacher

I already ran those benchmarks; you can see them here, in the rows labelled "PR #71 (xor_bytes)", and the relevant comparisons are here.

yarikoptic · 2022-02-22T21:08:22Z

zarr64-smallfiles results are a bit confusing, but overall it seems to me that PR #71, 5 threads seems to provide not that much overhead (and somehow at times faster? eg in zarr64-smallfiles, "oothreads" implementation) than baseline, 5 threads, unless I am looking at those results wrong

yarikoptic assigned jwodder Feb 16, 2022

This was referenced Feb 16, 2022

do not fscache individual files digests for zarr-checksum #914

Closed

zarr upload: do not pre-digest anything #915

Closed

yarikoptic mentioned this issue Feb 18, 2022

Cache directory fingerprint as a XORed hash of file fingerprints con/fscacher#71

Merged

jwodder mentioned this issue Feb 22, 2022

Minimize/optimize Zarr digestion when uploading #923

Merged

jwodder added the zarr label Feb 23, 2022

yarikoptic closed this as completed in #923 Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multithread digest'ing of zarr folders #913

multithread digest'ing of zarr folders #913

yarikoptic commented Feb 16, 2022

satra commented Feb 16, 2022

yarikoptic commented Feb 17, 2022

jwodder commented Feb 17, 2022

jwodder commented Feb 18, 2022

yarikoptic commented Feb 18, 2022

jwodder commented Feb 18, 2022 •

edited

Loading

yarikoptic commented Feb 18, 2022

jwodder commented Feb 21, 2022

yarikoptic commented Feb 21, 2022

yarikoptic commented Feb 22, 2022

jwodder commented Feb 22, 2022

yarikoptic commented Feb 22, 2022

multithread digest'ing of zarr folders #913

multithread digest'ing of zarr folders #913

Comments

yarikoptic commented Feb 16, 2022

satra commented Feb 16, 2022

yarikoptic commented Feb 17, 2022

jwodder commented Feb 17, 2022

jwodder commented Feb 18, 2022

yarikoptic commented Feb 18, 2022

jwodder commented Feb 18, 2022 • edited Loading

yarikoptic commented Feb 18, 2022

jwodder commented Feb 21, 2022

yarikoptic commented Feb 21, 2022

yarikoptic commented Feb 22, 2022

jwodder commented Feb 22, 2022

yarikoptic commented Feb 22, 2022

jwodder commented Feb 18, 2022 •

edited

Loading