Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

digest -d zarr-checksum seems to take some .dotdirs like .git or .dandi into account whenever it should skip those #1146

Closed
yarikoptic opened this issue Oct 27, 2022 · 8 comments · Fixed by #1147
Assignees

Comments

@yarikoptic
Copy link
Member

(dandisets) dandi@drogon:/mnt/backup/dandi/tmp/fc3f8c83-f4a9-48e4-a67c-aa24b530f82a$ fg
dandi digest -d zarr-checksum .
.: 0cbd5bcf66e160716528bbe02ee9ef15-133468--110916702211

whenever

(dandisets) dandi@drogon:/mnt/backup/dandi/tmp/fc3f8c83-f4a9-48e4-a67c-aa24b530f82a$ find . | grep -v  -e '\.dandi' -e '\.git' -e '\.datalad' | nl | tail -n 1
 32601  ./info

and without directories

(base) dandi@drogon:/mnt/backup/dandi/tmp/fc3f8c83-f4a9-48e4-a67c-aa24b530f82a$ find . \! -type d | grep -v  -e '\.dandi' -e '\.git' -e '\.datalad' | wc -l
32385

NB: local and archive stored checksum is completely off:

(dandisets) dandi@drogon:/mnt/backup/dandi/tmp/fc3f8c83-f4a9-48e4-a67c-aa24b530f82a$ cat .dandi/zarr-checksum 
edfb9e937142aceca60acef5fe5ba98a-4845--7049299697

related:

@jwodder
Copy link
Member

jwodder commented Oct 27, 2022

@yarikoptic What do you mean, "whenever it should skip those"? There's no rule (at the moment) that those should be ignored for Zarr checksum purposes.

Possible duplicate: #1086

@yarikoptic
Copy link
Member Author

It "should" as that we should skip those in zarr-digest estimation -- might indeed just require fix for #1086, but I think skipping .dandi and .datalad is specific to zarr checksumming , and if we decide to use threaded walk for something else we might or not want to skip them.

@jwodder
Copy link
Member

jwodder commented Oct 27, 2022

@yarikoptic

  • Exactly what files should be skipped? Note that backups2datalad currently regards .dandi/, .datalad/, .git/, .gitattributes, and .gitmodules as special and ignored for traversal purposes.
  • Should these files only be ignored when they occur at the root of the Zarr or anywhere in the Zarr?
  • Should the exclusion only apply to the dandi digest command or also to uploading Zarrs?

@yarikoptic
Copy link
Member Author

  • Exactly what files should be skipped? Note that backups2datalad currently regards .dandi/, .datalad/, .git/, .gitattributes, and .gitmodules as special and ignored for traversal purposes.

Let's skip all those.

  • Should these files only be ignored when they occur at the root of the Zarr or anywhere in the Zarr?

AFAIK they should be present only at the root. If we could easily add a check - if any of those happen anywhere not there then raise some RuntimeException -- would be the best so we do not miss some obscure use case. If that is tricky to do -- let's just skip them everywhere since I am not expecting them to be a part of any zarr folder anywhere under.

  • Should the exclusion only apply to the dandi digest command or also to uploading Zarrs?

also to uploading, thanks for checking on that.

@jwodder
Copy link
Member

jwodder commented Oct 27, 2022

@yarikoptic

  • If we're applying this exclusion to uploading, should we also exclude when downloading a Zarr (i.e., don't delete any local special files when synchronizing the file set) ?
  • I think trying to ignore these files regardless of what depth they appear at may make it difficult to enforce the "no more than 7 directories deep" check on Zarrs.

@yarikoptic
Copy link
Member Author

@yarikoptic

  • If we're applying this exclusion to uploading, should we also exclude when downloading a Zarr (i.e., don't delete any local special files when synchronizing the file set) ?

Hmm, I can't see any legit use case where we would have some local special files to coexist with new downloaded files from Zarr. Do you see the need/use case?

  • I think trying to ignore these files regardless of what depth they appear at may make it difficult to enforce the "no more than 7 directories deep" check on Zarrs.

Ok, let's just ignore on top only then for now

@jwodder
Copy link
Member

jwodder commented Oct 27, 2022

@yarikoptic

Do you see the need/use case?

dandi download can be used to sync a local Zarr with a copy on the server. If the local copy is also a Git repository/DataLad dataset, then downloading will currently delete .git and other special files from the Zarr. I don't know whether you consider that a likely use case, but if someone is already uploading a Zarr stored with special files (which seems unlikely to me), downloading into such a setup doesn't seem that unlikely.

@yarikoptic
Copy link
Member Author

Indeed it could potentially be used for such case. Not sure though I would worry about special-casing for it here in download. Please proceed the way you see it be best.

yarikoptic added a commit that referenced this issue Oct 31, 2022
Exclude special dotfiles from Zarrs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants