-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multithread digest'ing of zarr folders #913
Comments
isn't 20x less a benefit :) |
Ha ha, we don't see that 20x yet in facacher - we might be starting too much or just having it too Pythonic ;-) yet to profile etc |
I've created a script at https://github.com/jwodder/zarr-digest-timings for timing different Zarr checksumming methods with different caching configurations. In my initial tests (using large directory trees of small files), just using fscacher (both threaded & non-threaded) has had the worst effect on performance. Please try out the script on any decent sample Zarrs you have. |
@yarikoptic @satra Following up on the above: My attempts to do the benchmarking on the Hub were thwarted because my sesssion kept disconnecting, so I spun up a $10/month DigitalOcean droplet and used that. From timing on a directory tree of random data with a layout matching/based on
(Where caching was involved, two times are shown; the first is the runtime of the initial cache-populating call, and the second is the average runtime of subsequent calls.) Observations:
I also tried timing with 20 threads, but that ended up being a bit slower. |
if you still have instance available, could you please also provide timing for that |
@yarikoptic That's what the "fastio" column is (except it digests files piecemeal and also calculates a final Zarr checksum). (But if you really want the exact time for that exact script on my sample tree on the droplet, it's 59.799s.) |
great -- thank you! |
@yarikoptic Question: What's the plan for dealing with the fact that the upload code currently calculates the digest for a Zarr twice? The first time, the digest is used to determine whether to upload the Zarr asset at all (based on |
Well, #915 (was not assigned) was my summary which is inline (I think) with your thinking above -- that the first digesting should be dropped (for now at least). |
|
|
ATM, if I run
dandi digest
on a hot (was done before, so IO is fast) folder,dandi digest
gets just 30% CPU busy and takes > 20 sec, whenever a parallelized example for checksumming takes about x20 times less, and goes above 100% CPU.related PR introducing multithreaded walk in fscacher (benefit is not yet 100% clear): con/fscacher#67
The text was updated successfully, but these errors were encountered: