zarr upload: do not pre-digest anything #915

yarikoptic · 2022-02-16T20:02:50Z

A distilled variant of #903, which if decided to proceed with could take precedence over improving any related code (#913 for more efficient digesting of zarr folders; and #914 - disabling fscaching of individual files in zarr):

disable digesting of zarr folders in upload -- always assume that they differ
do not digest individual files unless really needed -- remote file is present, the same size, so its ETag is the only way to say for sure if it is the same
for now keep digesting for providing digests to mint upload urls for the batch. But I am afraid we better disable fscaching of those digests (thus in part do not fscache individual files digests for zarr-checksum #914) since seems to provide too notable impact
- ideally there should be a separate process/thread which would take care about pre-digesting files which might be missing digests for the next batch. This way we could minimize delay between batches
- md5 digest should be computed in-band while uploading and compared against ETag for the file whenever done
overall zarr archive checksum'ing on the server will be disabled, so we should then stop checking if our upload of the zarr is consistent with the remote upon completion.

Is that about right @satra, are we disabling checksumming of zarrs on dandi-api during upload @dchiquito ?

The text was updated successfully, but these errors were encountered:

satra · 2022-02-16T20:20:51Z

@dchiquito - the presigned url generation should make the md5 parameter optional. we are planning to upload and check on the client side and the client can push a tree-checksum after upload is complete that the server could verify.

but for the moment i would disable generation of checksum files.

yarikoptic · 2022-02-16T21:49:43Z

and the client can push a tree-checksum after upload is complete that the server could verify.

hm, this sounds a bit backwards since we do not have really "invalid" state on the server (and what to do if we get into one?), that is why client is "pull"ing the checksum, verifies and alerts the user. seems needs more thinking

satra · 2022-02-17T00:29:22Z

we do have invalid state - checksum pending for normal assets, so when server finishes calculating checksum, which can take some time potentially, it can update the asset state. the client would have to wait, which in some cases could be minutes, and i don't think we want the client during upload to do that. unless we know that checksum is within say 30s of the last batch uploaded. if that is the case, we could have the client wait to get the value back.

yarikoptic · 2022-02-17T19:07:31Z

we do have invalid state - checksum pending for normal assets, so when server finishes calculating checksum, which can take some time potentially, it can update the asset state

yes. In the description above for me of importance is that the "state" is not informed by client but by the server -- checksum is computed (on the server), and whatever is computed it is what server can rely to be valid (since it is the server who did it). It is nohow informed by the checksum client might provide (which might be incorrect, not uptodate or whatnot), thus invalidating a "valid" (as on S3) checksum for the zarr. That is why IMHO it is for client to verify that server has something what client expects, and if not -- mitigate (reupload fully or partially) but not mess with the state of the asset as server knows directly. (and that is why we compute sha256 instead of just taking it from client)

but for the moment i would disable generation of checksum files.

is that generation/updates is a bottleneck (I do not remember clear demonstration that it is)? If so, then IMHO we should look into mitigating it ("aggregate" .checksums files? or keep them in DB entirely) instead of completely disabling, because ensuring consistency might only require more complicated implementation later (e.g. needing to lock zarr to avoid any changes while checksum is computed; dandi-cli would need to channel that error to users etc). But I would you and @dchiquito to decide on what we do on that.

If/when checksums updates are disabled, we would need to adjust dandi-cli to not wait/expect them, and also to use ETag to decide on either a specific file needs to be uploaded or not if API would stop providing checksums. That would also require interactions with S3 per each tiny file, making it slow.

jwodder · 2022-02-21T16:12:13Z

@yarikoptic

overall zarr archive checksum'ing on the server will be disabled, so we should then stop checking if our upload of the zarr is consistent with the remote upon completion.

Is there a dandi-archive issue for this? Aside from the contemplative changes to the API that @satra's talking about, this seems to be the current blocker for this issue.

satra · 2022-02-21T16:34:44Z

@dchiquito - is the md5 parameter optional? if not, can it be made optional, for this strategy to be implemented.

@yarikoptic, @dchiquito, and @jwodder - regarding checksums

the zarr trees are going to be large, whether now or in the future.
we will need algorithm to determine what the delete/add/update operations are going to be given a local tree and a remote tree (including an empty or a partial upload tree)
we need efficiency (not brute force). if we are to use checksum files for every directory/file, we need to ensure that traversal in the context of the algorithm is operationally acceptable, and so is any delete operation (which can be time consuming).

yarikoptic · 2022-02-21T21:16:56Z

@satra insofar we have no timings on different approaches to tell "efficient" from "brute force" apart or what those would mean. IMHO it would indeed be more efficient to operate on tree checksums, but ATM we are trying to get away from computing those due to shortcoming which ruins upload efficiency. Or am I missing something?

@yarikoptic

overall zarr archive checksum'ing on the server will be disabled, so we should then stop checking if our upload of the zarr is consistent with the remote upon completion.

Is there a dandi-archive issue for this? Aside from the contemplative changes to the API that @satra's talking about, this seems to be the current blocker for this issue.

there is dandi/dandi-archive#912 now , but in original issue description I was not even aiming for that right away (that is why "for now" 3rd clause) and IMHO think now that it is not even really needed since is not the "slow downer" really for the uploads: besides the first batch we could pre-digest for the next batch (in a separate thread) before we request URLs for the next batch, thus eliminate any individual digestion files wait time. That would retain that portion of the upload code. We just need to eliminate the "entire zarr folder" pre-digestion.

satra · 2022-02-21T21:54:49Z

insofar we have no timings on different approaches to tell "efficient" from "brute force" apart or what those would mean.

we do have timings of many different kinds.

with the fastio threaded approach we have timings for digesting. i think john has already shown this to be much faster.
upload timing based on wrapper around s5cmd. that's the benchmark to beat. but to truly compare, we would need to disable md5 digest. s5cmd doesn't do digesting, it simply checks timestamp + filesize. this is also much faster than anything with CLI or even just vanilla api scripts (> 2x as shown by jake's experiments)
and we have timing for fetching the digests from s3 with s5cmd (this is also very fast compared to doing the same thing with aws cli).

i have run 1 -> 2-> 3 on an example ngff directory. that overall timing with s5cmd is significantly faster than my previous attempt to upload with dandi cli. caveat: because of impending needs, i have only tested upload of a fresh batch. not syncing of modified batches.

my current attention is on trying to get this data rechunked and uploaded as fast as possible, and every aspect of that process needs hypertuning. hence, i was hoping that someone else would do the comparison given all the scripts (on the hub) and data (in staging and hub).

jwodder · 2022-02-22T14:35:24Z

@yarikoptic So the things to do for this issue, in no particular order, are:

Disable the initial checksumming of Zarr assets in the upload() function, always re-upload Zarrs as though --existing were set to force, and don't verify any digests in the Zarr metadata passed to ZarrAsset.iter_upload()
- This will require the server to fill in the Zarr's checksum in the "digest" field in the asset metadata; does it do that?
Don't digest individual Zarr entries until the digest is actually needed, either for comparing with a remote entry at the same path of the same size or for use in the /zarr/{zarr_id}/upload/ payload
Do as much entry digestion in threads as possible
- How many threads? Should this be controlled by the same setting that determines the number of upload threads per asset? Note that, until "Optional digests" is implemented, we'd have to be running digestion threads in parallel with upload threads. (Should those run in separate thread pools or the same thread pool?)
- Until "Optional digests" is implemented, there are two reasons why we would digest an entry: to compare it to a remote entry with the same path & size in order to determine whether to upload the entry, and to provide the digest in the upload payload. Digestions of the first type need to be completed before starting any actual uploads if we want the upload percentages yielded by ZarrAsset.iter_upload() to correctly reflect the total amount of data scheduled for upload; is that OK?
Don't use fscacher when digesting local entries
Don't compute the complete Zarr checksum, and don't verify the server's checksum after upload
- Is this blocked by anything on the dandi-archive side?
("Optional digests") Don't digest entries for the upload payload (Blocked by /zarr/{zarr_id}/upload/: make etag (md5 ) optional dandi-archive#912)
Digest entries while uploading and compare the result to the returned ETag
- Should this only be done as part of "Optional digests"?

Is that complete/correct?

Minimize/optimize Zarr digestion when uploading

yarikoptic mentioned this issue Feb 17, 2022

Provide support for zarr (.zarr/.ngff) dandi/dandisets#127

Closed

yarikoptic assigned jwodder Feb 21, 2022

yarikoptic mentioned this issue Feb 21, 2022

multithread digest'ing of zarr folders #913

Closed

yarikoptic mentioned this issue Feb 21, 2022

/zarr/{zarr_id}/upload/: make etag (md5 ) optional dandi/dandi-archive#912

Closed

jwodder mentioned this issue Feb 22, 2022

Minimize/optimize Zarr digestion when uploading #923

Merged

jwodder added the zarr label Feb 23, 2022

jwodder mentioned this issue Feb 25, 2022

--digest zarr-checksum incorrectly identifies directory as "empty" #926

Closed

yarikoptic closed this as completed in #923 Feb 25, 2022

yarikoptic added a commit that referenced this issue Feb 25, 2022

Merge pull request #923 from dandi/gh-915

a1a7fd5

Minimize/optimize Zarr digestion when uploading

jwodder mentioned this issue Mar 8, 2022

Digest Zarr entries while uploading but not before #934

Closed

jwodder mentioned this issue Jul 27, 2022

upload: make "digesting" in-band for zarr uploads OR just speed/cache zarr checksumming? #903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zarr upload: do not pre-digest anything #915

zarr upload: do not pre-digest anything #915

yarikoptic commented Feb 16, 2022

satra commented Feb 16, 2022

yarikoptic commented Feb 16, 2022 •

edited

Loading

satra commented Feb 17, 2022

yarikoptic commented Feb 17, 2022

jwodder commented Feb 21, 2022

satra commented Feb 21, 2022

yarikoptic commented Feb 21, 2022

satra commented Feb 21, 2022

jwodder commented Feb 22, 2022 •

edited

Loading

zarr upload: do not pre-digest anything #915

zarr upload: do not pre-digest anything #915

Comments

yarikoptic commented Feb 16, 2022

satra commented Feb 16, 2022

yarikoptic commented Feb 16, 2022 • edited Loading

satra commented Feb 17, 2022

yarikoptic commented Feb 17, 2022

jwodder commented Feb 21, 2022

satra commented Feb 21, 2022

yarikoptic commented Feb 21, 2022

satra commented Feb 21, 2022

jwodder commented Feb 22, 2022 • edited Loading

yarikoptic commented Feb 16, 2022 •

edited

Loading

jwodder commented Feb 22, 2022 •

edited

Loading