-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zarr upload: do not pre-digest anything #915
Comments
@dchiquito - the presigned url generation should make the md5 parameter optional. we are planning to upload and check on the client side and the client can push a tree-checksum after upload is complete that the server could verify. but for the moment i would disable generation of checksum files. |
hm, this sounds a bit backwards since we do not have really "invalid" state on the server (and what to do if we get into one?), that is why client is "pull"ing the checksum, verifies and alerts the user. seems needs more thinking |
we do have invalid state - checksum pending for normal assets, so when server finishes calculating checksum, which can take some time potentially, it can update the asset state. the client would have to wait, which in some cases could be minutes, and i don't think we want the client during upload to do that. unless we know that checksum is within say 30s of the last batch uploaded. if that is the case, we could have the client wait to get the value back. |
yes. In the description above for me of importance is that the "state" is not informed by client but by the server -- checksum is computed (on the server), and whatever is computed it is what server can rely to be valid (since it is the server who did it). It is nohow informed by the checksum client might provide (which might be incorrect, not uptodate or whatnot), thus invalidating a "valid" (as on S3) checksum for the zarr. That is why IMHO it is for client to verify that server has something what client expects, and if not -- mitigate (reupload fully or partially) but not mess with the state of the asset as server knows directly. (and that is why we compute sha256 instead of just taking it from client)
is that generation/updates is a bottleneck (I do not remember clear demonstration that it is)? If so, then IMHO we should look into mitigating it ("aggregate" .checksums files? or keep them in DB entirely) instead of completely disabling, because ensuring consistency might only require more complicated implementation later (e.g. needing to lock zarr to avoid any changes while checksum is computed; dandi-cli would need to channel that error to users etc). But I would you and @dchiquito to decide on what we do on that. If/when checksums updates are disabled, we would need to adjust dandi-cli to not wait/expect them, and also to use ETag to decide on either a specific file needs to be uploaded or not if API would stop providing checksums. That would also require interactions with S3 per each tiny file, making it slow. |
Is there a dandi-archive issue for this? Aside from the contemplative changes to the API that @satra's talking about, this seems to be the current blocker for this issue. |
@dchiquito - is the md5 parameter optional? if not, can it be made optional, for this strategy to be implemented. @yarikoptic, @dchiquito, and @jwodder - regarding checksums
|
@satra insofar we have no timings on different approaches to tell "efficient" from "brute force" apart or what those would mean. IMHO it would indeed be more efficient to operate on tree checksums, but ATM we are trying to get away from computing those due to shortcoming which ruins upload efficiency. Or am I missing something?
there is dandi/dandi-archive#912 now , but in original issue description I was not even aiming for that right away (that is why "for now" 3rd clause) and IMHO think now that it is not even really needed since is not the "slow downer" really for the uploads: besides the first batch we could pre-digest for the next batch (in a separate thread) before we request URLs for the next batch, thus eliminate any individual digestion files wait time. That would retain that portion of the upload code. We just need to eliminate the "entire zarr folder" pre-digestion. |
we do have timings of many different kinds.
i have run 1 -> 2-> 3 on an example ngff directory. that overall timing with s5cmd is significantly faster than my previous attempt to upload with dandi cli. caveat: because of impending needs, i have only tested upload of a fresh batch. not syncing of modified batches. my current attention is on trying to get this data rechunked and uploaded as fast as possible, and every aspect of that process needs hypertuning. hence, i was hoping that someone else would do the comparison given all the scripts (on the hub) and data (in staging and hub). |
@yarikoptic So the things to do for this issue, in no particular order, are:
Is that complete/correct? |
Minimize/optimize Zarr digestion when uploading
A distilled variant of #903, which if decided to proceed with could take precedence over improving any related code (#913 for more efficient digesting of zarr folders; and #914 - disabling fscaching of individual files in zarr):
upload
-- always assume that they differIs that about right @satra, are we disabling checksumming of zarrs on dandi-api during upload @dchiquito ?
The text was updated successfully, but these errors were encountered: