Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr upload: do not pre-digest anything #915

Closed
yarikoptic opened this issue Feb 16, 2022 · 9 comments · Fixed by #923
Closed

zarr upload: do not pre-digest anything #915

yarikoptic opened this issue Feb 16, 2022 · 9 comments · Fixed by #923
Assignees
Labels

Comments

@yarikoptic
Copy link
Member

A distilled variant of #903, which if decided to proceed with could take precedence over improving any related code (#913 for more efficient digesting of zarr folders; and #914 - disabling fscaching of individual files in zarr):

  • disable digesting of zarr folders in upload -- always assume that they differ
  • do not digest individual files unless really needed -- remote file is present, the same size, so its ETag is the only way to say for sure if it is the same
  • for now keep digesting for providing digests to mint upload urls for the batch. But I am afraid we better disable fscaching of those digests (thus in part do not fscache individual files digests for zarr-checksum #914) since seems to provide too notable impact
    • ideally there should be a separate process/thread which would take care about pre-digesting files which might be missing digests for the next batch. This way we could minimize delay between batches
    • md5 digest should be computed in-band while uploading and compared against ETag for the file whenever done
  • overall zarr archive checksum'ing on the server will be disabled, so we should then stop checking if our upload of the zarr is consistent with the remote upon completion.

Is that about right @satra, are we disabling checksumming of zarrs on dandi-api during upload @dchiquito ?

@satra
Copy link
Member

satra commented Feb 16, 2022

@dchiquito - the presigned url generation should make the md5 parameter optional. we are planning to upload and check on the client side and the client can push a tree-checksum after upload is complete that the server could verify.

but for the moment i would disable generation of checksum files.

@yarikoptic
Copy link
Member Author

yarikoptic commented Feb 16, 2022

and the client can push a tree-checksum after upload is complete that the server could verify.

hm, this sounds a bit backwards since we do not have really "invalid" state on the server (and what to do if we get into one?), that is why client is "pull"ing the checksum, verifies and alerts the user. seems needs more thinking

@satra
Copy link
Member

satra commented Feb 17, 2022

we do have invalid state - checksum pending for normal assets, so when server finishes calculating checksum, which can take some time potentially, it can update the asset state. the client would have to wait, which in some cases could be minutes, and i don't think we want the client during upload to do that. unless we know that checksum is within say 30s of the last batch uploaded. if that is the case, we could have the client wait to get the value back.

@yarikoptic
Copy link
Member Author

we do have invalid state - checksum pending for normal assets, so when server finishes calculating checksum, which can take some time potentially, it can update the asset state

yes. In the description above for me of importance is that the "state" is not informed by client but by the server -- checksum is computed (on the server), and whatever is computed it is what server can rely to be valid (since it is the server who did it). It is nohow informed by the checksum client might provide (which might be incorrect, not uptodate or whatnot), thus invalidating a "valid" (as on S3) checksum for the zarr. That is why IMHO it is for client to verify that server has something what client expects, and if not -- mitigate (reupload fully or partially) but not mess with the state of the asset as server knows directly. (and that is why we compute sha256 instead of just taking it from client)

but for the moment i would disable generation of checksum files.

is that generation/updates is a bottleneck (I do not remember clear demonstration that it is)? If so, then IMHO we should look into mitigating it ("aggregate" .checksums files? or keep them in DB entirely) instead of completely disabling, because ensuring consistency might only require more complicated implementation later (e.g. needing to lock zarr to avoid any changes while checksum is computed; dandi-cli would need to channel that error to users etc). But I would you and @dchiquito to decide on what we do on that.

If/when checksums updates are disabled, we would need to adjust dandi-cli to not wait/expect them, and also to use ETag to decide on either a specific file needs to be uploaded or not if API would stop providing checksums. That would also require interactions with S3 per each tiny file, making it slow.

@jwodder
Copy link
Member

jwodder commented Feb 21, 2022

@yarikoptic

overall zarr archive checksum'ing on the server will be disabled, so we should then stop checking if our upload of the zarr is consistent with the remote upon completion.

Is there a dandi-archive issue for this? Aside from the contemplative changes to the API that @satra's talking about, this seems to be the current blocker for this issue.

@satra
Copy link
Member

satra commented Feb 21, 2022

@dchiquito - is the md5 parameter optional? if not, can it be made optional, for this strategy to be implemented.

@yarikoptic, @dchiquito, and @jwodder - regarding checksums

  • the zarr trees are going to be large, whether now or in the future.
  • we will need algorithm to determine what the delete/add/update operations are going to be given a local tree and a remote tree (including an empty or a partial upload tree)
  • we need efficiency (not brute force). if we are to use checksum files for every directory/file, we need to ensure that traversal in the context of the algorithm is operationally acceptable, and so is any delete operation (which can be time consuming).

@yarikoptic
Copy link
Member Author

@satra insofar we have no timings on different approaches to tell "efficient" from "brute force" apart or what those would mean. IMHO it would indeed be more efficient to operate on tree checksums, but ATM we are trying to get away from computing those due to shortcoming which ruins upload efficiency. Or am I missing something?

@yarikoptic

overall zarr archive checksum'ing on the server will be disabled, so we should then stop checking if our upload of the zarr is consistent with the remote upon completion.

Is there a dandi-archive issue for this? Aside from the contemplative changes to the API that @satra's talking about, this seems to be the current blocker for this issue.

there is dandi/dandi-archive#912 now , but in original issue description I was not even aiming for that right away (that is why "for now" 3rd clause) and IMHO think now that it is not even really needed since is not the "slow downer" really for the uploads: besides the first batch we could pre-digest for the next batch (in a separate thread) before we request URLs for the next batch, thus eliminate any individual digestion files wait time. That would retain that portion of the upload code. We just need to eliminate the "entire zarr folder" pre-digestion.

@satra
Copy link
Member

satra commented Feb 21, 2022

insofar we have no timings on different approaches to tell "efficient" from "brute force" apart or what those would mean.

we do have timings of many different kinds.

  1. with the fastio threaded approach we have timings for digesting. i think john has already shown this to be much faster.
  2. upload timing based on wrapper around s5cmd. that's the benchmark to beat. but to truly compare, we would need to disable md5 digest. s5cmd doesn't do digesting, it simply checks timestamp + filesize. this is also much faster than anything with CLI or even just vanilla api scripts (> 2x as shown by jake's experiments)
  3. and we have timing for fetching the digests from s3 with s5cmd (this is also very fast compared to doing the same thing with aws cli).

i have run 1 -> 2-> 3 on an example ngff directory. that overall timing with s5cmd is significantly faster than my previous attempt to upload with dandi cli. caveat: because of impending needs, i have only tested upload of a fresh batch. not syncing of modified batches.

my current attention is on trying to get this data rechunked and uploaded as fast as possible, and every aspect of that process needs hypertuning. hence, i was hoping that someone else would do the comparison given all the scripts (on the hub) and data (in staging and hub).

@jwodder
Copy link
Member

jwodder commented Feb 22, 2022

@yarikoptic So the things to do for this issue, in no particular order, are:

  • Disable the initial checksumming of Zarr assets in the upload() function, always re-upload Zarrs as though --existing were set to force, and don't verify any digests in the Zarr metadata passed to ZarrAsset.iter_upload()
    • This will require the server to fill in the Zarr's checksum in the "digest" field in the asset metadata; does it do that?
  • Don't digest individual Zarr entries until the digest is actually needed, either for comparing with a remote entry at the same path of the same size or for use in the /zarr/{zarr_id}/upload/ payload
  • Do as much entry digestion in threads as possible
    • How many threads? Should this be controlled by the same setting that determines the number of upload threads per asset? Note that, until "Optional digests" is implemented, we'd have to be running digestion threads in parallel with upload threads. (Should those run in separate thread pools or the same thread pool?)
    • Until "Optional digests" is implemented, there are two reasons why we would digest an entry: to compare it to a remote entry with the same path & size in order to determine whether to upload the entry, and to provide the digest in the upload payload. Digestions of the first type need to be completed before starting any actual uploads if we want the upload percentages yielded by ZarrAsset.iter_upload() to correctly reflect the total amount of data scheduled for upload; is that OK?
  • Don't use fscacher when digesting local entries
  • Don't compute the complete Zarr checksum, and don't verify the server's checksum after upload
    • Is this blocked by anything on the dandi-archive side?
  • ("Optional digests") Don't digest entries for the upload payload (Blocked by ​/zarr​/{zarr_id}​/upload​/: make etag (md5 ) optional dandi-archive#912)
  • Digest entries while uploading and compare the result to the returned ETag
    • Should this only be done as part of "Optional digests"?

Is that complete/correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants