Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check digests when redownloading #1364

Closed

Conversation

sneakers-the-rat
Copy link
Contributor

@sneakers-the-rat sneakers-the-rat commented Nov 22, 2023

Hello again (again!)

Hoping this unsolicited PR isn't unwelcome - figured it would be lightweight enough that it didn't need to be pre-checked.

This PR checks file digests when deciding to re-download a file or not. Basically it implements this TODO:

# TODO: use digests if available? or if e.g. size is identical
# but mtime is different

In downloading datasets I had one disk go down, and IT keeps messing with my network so i get intermittent connection breaks that cause downloads to stall, and so I am pretty worried about corrupted data through all that. It makes sense to me to check file hashes at the time of deciding to redownload (so semantically, attempting to download a file to the same location always results in the same file for a given dandiset and version). Thinking about the distributed future of dandi, we'll need to be doing a whole lot of hashing - so eg. this is a step towards being able to validate data from a mirror server against the hashes hosted on the 'main' server.

Pretty straightforward:

  • made helper function to check_digests - i noticed that in the relevant places that there were several possible digests that could be checked against, so to tidy up a few if/elses i just put that into one function that checks hashes in order of priority. I do this rather than ensuring that all hashes match or just testing the first one because not all hashing functions are created equal (eg. md5 has been broken for >10 years and should be deprecated) so we do want to explicitly prefer some, esp. since in the future more will break.
  • the OVERWRITE_DIFFERENT leg is unchanged - just refactored using check_digests
  • in REFRESH ...
    • If digests is None or an empty dict, behavior is unchanged.
    • If the mtime is missing and we have a digest, check it. don't redownload if digest matches.
    • If we have mtime, first check it and size. If either doesn't match, redownload without checking digest
    • If mtime and size match, and we have a digest, then check it. if it matches, don't redownload.
  • Also added more log messages so we have greppable logs on all legs.
  • Tests for the check_digests function and to ensure that files that match size and mtime but don't match hash trigger a re-download.

Tests pass, ran precommit actions. Not sure why mypy is failing on the ZarrChecksumTree import, i didn't touch that code. I see the code coverage checker complaining, i didn't see tests for the _download_file function itself in the other tests, which i'd need to do to check against a dandiset without a digest, but i can definitely do that if needed.

Potential changes

  • Check order: i had written it to quickly decide to redownload if mtime or size were not matching and only check file digests if they did, that seems like a reasonable order to me but it could be changed!
  • Refactoring/tidying: the check to download at the top of _download_file is now quite long, and it seems like a separate enough operation that it should be split out into its own function, but didn't want to do too much in one PR without checking.
  • Optionality: should this be optional? I think in general when one is trying to redownload a file to the same directory with one of the relevant options (refresh or overwrite different) that one always wants to be sure that the file they have matches the one on the server. the digest might take awhile, but it's the true check that the file is the same. If we want this to be optional I can propagate a parameter up through to the calling functions and click command.
  • Put in the digest command? The current dandi digest command just computes the hash, but we could also have it check it against the server hashes as well?

probably more! lmk what needs to be changed :).


Side note: I followed the DEVELOPMENT.md guide to run local tests, but the vendored docker-compose files didn't work, and the fixtures aren't set up to use the dandi-archive docker container i had set up outside of the dandi-cli repo. I had to add an extra DANDI_DOCKER_DIR env variable to replace the LOCAL_DOCKER_DIR which is hardcoded to a directory beneath dandi/tests/fixtures Pretty sure i followed instructions, and it looked like some parts of that doc have gotten out of sync with the fixtures. I also had to override the versioneer auto-versioning which caused ~half of the tests to fail since my version was something like dandi-0+untagged.3399.g20efd37 since tags don't get fetched for forks by default.

- redownload immediately if size or mtime don't match, check digest only if they do.
- debug messages on all arms of decision tow download
- move import into download function
- test using debug messages
Copy link

codecov bot commented Nov 22, 2023

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (114da47) 88.69% compared to head (b6c2195) 88.65%.
Report is 1 commits behind head on master.

Files Patch % Lines
dandi/download.py 50.00% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1364      +/-   ##
==========================================
- Coverage   88.69%   88.65%   -0.04%     
==========================================
  Files          77       77              
  Lines       10434    10503      +69     
==========================================
+ Hits         9254     9311      +57     
- Misses       1180     1192      +12     
Flag Coverage Δ
unittests 88.65% <83.75%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sneakers-the-rat
Copy link
Contributor Author

Interesting. looks like the windows tests are failing - it's saying that a file of all zeros has the same hash as the real file. I'm assuming that's because the PersistentCache doesn't recognize the file has changed, so it's returning the memoized digests. I could make that test pass by taking the digest before setting the mtime, but it seems like that's a correct failure catching a bug in the get_digest function that we shouldn't ignore - that would defeat checksumming in practice.

that actually seems like it's coming from the way fscache is using ctime_ns which is deprecated in windows: https://docs.python.org/3/library/os.html#os.stat_result.st_ctime_ns

https://github.com/con/fscacher/blob/e3a6dee365f89a3b0212d19976e61a8165bf0488/src/fscacher/cache.py#L209C10-L209C10

@jwodder jwodder added the minor Increment the minor version when merged label Nov 22, 2023
@jwodder jwodder changed the title [Feature][minor] - Check Digests when Redownloading Check Digests when Redownloading Nov 22, 2023
@jwodder jwodder changed the title Check Digests when Redownloading Check digests when redownloading Nov 22, 2023
dandi/support/digests.py Outdated Show resolved Hide resolved
dandi/support/digests.py Outdated Show resolved Hide resolved
Comment on lines 158 to 160
Only check the first digest that is present in ``digests`` , rather than
checking all and returning if any are ``True`` - save compute time and
honor the notion of priority - some hash algos are better than others.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Only check the first digest that is present in ``digests`` , rather than
checking all and returning if any are ``True`` - save compute time and
honor the notion of priority - some hash algos are better than others.
Only the first digest algorithm in ``priority`` that is present in
``digests`` is checked.

Describing an approach we're not doing is confusing, and "why" reasons generally don't belong in docstrings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually find "why" explanations helpful - eg. why not have a function that just take one hash rather than a dict of hashes, and if we are passing multiple hashes why wouldn't we check all of them - but sure i'll move that out of the docstring, no strong opinions. the test should protect against future memory loss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated docstring to also describe params and that makes previous description of what we're not doing obsolete anyway

dandi/support/digests.py Outdated Show resolved Hide resolved
dandi/support/digests.py Outdated Show resolved Hide resolved
dandi/tests/test_download.py Outdated Show resolved Hide resolved
dandi/tests/test_download.py Outdated Show resolved Hide resolved
dandi/support/digests.py Outdated Show resolved Hide resolved
dandi/download.py Outdated Show resolved Hide resolved
dandi/download.py Outdated Show resolved Hide resolved
@yarikoptic
Copy link
Member

and so I am pretty worried about corrupted data through all that.

unrelated to the destiny of this PR, I would welcome you to try datalad or git/git-annex directly on the https://github.com/dandi/dandisets which should provide quite an up to date and tightly version/digests controlled access to DANDI data.

@@ -578,26 +583,16 @@ def _download_file(
and "sha256" in digests
):
if key_parts[-1].partition(".")[0] == digests["sha256"]:
lgr.debug("already exists - matching digest in filename")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is always "not" fun to guess what message is talking about, so better use

Suggested change
lgr.debug("already exists - matching digest in filename")
lgr.debug("%s already exists - matching digest in filename", path)

or alike here and elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Also noticed that there was a checksum yield message that would apply when we skip because it matches and added that.

@sneakers-the-rat
Copy link
Contributor Author

I think i made all the suggested changes, only thing that i added is also yielding {'checksum': 'ok'} in cases where we check the digest and skip the download. typing test is still mysteriously failing, and windows tests also still fail presumably because digests are being cached differently than they are on mac/linux, lmk what would be good on that :)

@yarikoptic
Copy link
Member

@jwodder any clue why mypy now becomes unhappy? or it is just a coincidence and just something new which came up now?

typing: commands[0]> mypy dandi
dandi/support/digests.py:109: error: Module "zarr_checksum" does not explicitly
export attribute "ZarrChecksumTree"  [attr-defined]
        from zarr_checksum import ZarrChecksumTree
        ^
dandi/files/zarr.py:[29](https://github.com/dandi/dandi-cli/actions/runs/6964564563/job/18951986129?pr=1364#step:5:30)2: error: Module "zarr_checksum" does not explicitly
export attribute "ZarrChecksumTree"  [attr-defined]
            from zarr_checksum import ZarrChecksumTree
            ^
Found 2 errors in 2 files (checked 80 source files)

re windows: needs to be troubleshooted I guess.

):
yield _skip_file("already exists")
elif digests is not None and check_digests(path, digests):
lgr.debug(f"{path!r} already exists - matching digest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought, in the light of e.g. dandi/dandi-archive#1750, that we fully avoided using f-strings in logging but at least 2 did sneak in already

❯ git grep 'lgr\.[^(]*(f'
dandi/download.py:                lgr.debug(f"{path!r} - same attributes: {same}.  Redownloading")
dandi/upload.py:        lgr.info(f"Found {len(dandi_files)} files to consider")

I am not that much of a puritan so wouldn't force a chance here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha! i didn't realize there was a practical reason for this, thought it was just old style string formatting. did this in bc28daa

@jwodder
Copy link
Member

jwodder commented Nov 27, 2023

@yarikoptic The typing errors are addressed by #1365.

@yarikoptic
Copy link
Member

@jwodder could you please analyze that windows issue, for which @sneakers-the-rat also reported con/fscacher#87 ?

@sneakers-the-rat
Copy link
Contributor Author

The typing errors are addressed by #1365.

Merged from master

could you please analyze that windows issue, for which @sneakers-the-rat also reported con/fscacher#87 ?

lmk how i can help - i can force the cache to be invalidated in the tests to make them pass, but it seems like it's worth addressing the underlying source of inconsistency

Copy link
Member

@jwodder jwodder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review so far.

download(url, tmp_path, get_metadata=False)
dsdir = tmp_path / "000027"
nwb = dsdir / "sub-RAT123" / "sub-RAT123.nwb"
digests = digester(str(nwb))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
digests = digester(str(nwb))
digests = digester(nwb)

os.utime(nwb, (time.time(), mtime))

# now original digests should fail since it's a bunch of zeros
zero_digests = digester(str(nwb))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zero_digests = digester(str(nwb))
zero_digests = digester(nwb)

assert "successfully downloaded" in last_5.lower()

# should pass again
redownload_digests = digester(str(nwb))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
redownload_digests = digester(str(nwb))
redownload_digests = digester(nwb)

Comment on lines +199 to +200
for dtype in redownload_digests.keys():
assert redownload_digests[dtype] == digests[dtype]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for dtype in redownload_digests.keys():
assert redownload_digests[dtype] == digests[dtype]
assert redownload_digests == digests

lgr.warning(
f"{path!r} - no mtime or ctime in the record, redownloading"
)
if digests is not None and check_digests(path, digests):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior. As documented here and here, "refresh" is only supposed to check size and mtime.

@jwodder
Copy link
Member

jwodder commented Nov 28, 2023

@yarikoptic @sneakers-the-rat As I indicated here, making the "refresh" mode compare digests is a change in behavior. It's currently documented to only check size & mtime, and if we were to make it check digests as well, what would be the difference between "refresh" and "overwrite-different"?

@sneakers-the-rat
Copy link
Contributor Author

Yes! i am proposing a change in behavior - I wasn't sure what the difference was, so plz lmk if i'm missing some context here. It seems like the intention of both of them is "check if we have the file we're about to download, if so skip."

The current OVERWRITE_DIFFERENT implementation a) does a special-cased check for a sha-256 hash in the filename if it is in git annex and then b) checks for dandi-etags or md5 hashes if they're present, so one thing this PR does is consolidate (b) so eg. it's easier to change digest types if we want to eg. do IPFS things with CIDs and whatnot

REFRESH seems to just be a faster but weaker version of the same intention. So ideally we would consolidate them, but i wasn't sure how y'all felt about that and didn't want to do too much in the same PR. The naming suggests that REFRESH was intended to synchronize an already-downloaded dataset if the version on the server was updated, but that's also covered by a more general "download if different" strategy that first checks for mismatches in mtime or size and only then checks the hash. That might be slow, but since y'all cache hash digests on disk, repeated calls to refresh/overwrite_different should be fast if mtime/size haven't changed. The difference pre and post then would be excluding the case where size and mtime match but the data doesn't.

So maybe we need a change in semantics? We could collapse those down into one download mode. If the difference between the two modes at the moment is check for sameness with time and size vs. hash then i think the names could be made a little clearer. since refresh can also overwrite, maybe something like

download --existing={error,skip,refresh,force}

which could also move downloading closer to uploading semantics:

eg currently:

op download upload
error raise error raise error
skip skip skip
force always overwrite
overwrite always overwrite overwrite if size or mtime differs
overwrite-different overwrite if size or hash differs
refresh overwrite if size or mtime differs overwrite if local mtime > remote

we could do

op download upload
error (same) (same)
skip (same) (same)
force always overwrite always overwrite
refresh overwrite if size, mtime, or hash differs overwrite if size or hash differs and local mtime > remote mtime

which isn't perfectly symmetric, but uploading and downloading aren't symmetric actions either in this case.

a humble proposal!

@yarikoptic
Copy link
Member

Hi @sneakers-the-rat and sorry for the delay on providing more feedback here. Indeed there seems to be some need/desire to streamline modes more! The original desire behind refresh is to be able to do a "quick refresh", so collapsing with "check also a checksum" would ruin that. But I also see desired specificity of overwrite-different as it

  • would not even rely/test mtime at all, and operate solely based on content/(size and then checksum)
  • would not lead to a "refresh" of e.g. mtime of the file so that a subsequent refresh could be used for a quick up-to-dateness assurance.

I guess we could gain refresh-thorough (or replace current refresh with refresh-fast and introduce a new refresh) which would

  • still use size and mtime for shortcut to cause download if any of them differ
  • if those are identical - go to checksums

It would be different from download-different as having mtime based shortcut, but that would be the only difference if I see it right. The question would be -- is it worth the effort in addition to overwrite-different? should we compliment its functioning with

  • adjust mtime (if can) to match the remote one
    ?

@sneakers-the-rat
Copy link
Contributor Author

sneakers-the-rat commented Jan 29, 2024

Gotcha, makes sense.

So one option is to make a separate action for update or sync rather than overloading download - it seems like the intent of "ensure what I have is the same as on the server" is different enough from "get this item from the server" that it deserves its own command, maybe that fits in with the current digest function?

The basic tradeoff is certainty vs. speed, right? The purpose of checksumming is to give an option where someone can be certain that what they have locally is the same thing as what's on the server without needing to do the full download. ie. there isn't a case where it would be desirable to have a false negative for triggering some update except for time.
So in order of costliness:

  • Check mtime/size
  • Checksum
  • Downloading file

I am still not quite sure why overwrite-different = (size + checksum) and refresh = (size + mtime) would be two different paths. As is I think this would be the flowchart currently:

flowchart TD
    Download
    DoDownload[Do Download]
    DLMode[Download Mode]
    DontDownload[Don't Download]
    CheckSize[Check Size]
    CheckMTime[Check mtime]
    Checksum[Checksum]
    

    Download -- "Doesnt Exist" --> DoDownload
    Download -- "Does Exist" --> DLMode
    DLMode -- "`ignore`" --> DontDownload
    DLMode -- "overwrite" --> DoDownload
    DLMode -- "overwrite-different" --> CheckSize
    DLMode -- "refresh" --> CheckSize
    CheckSize -- "match (overwrite-different)" --> Checksum
    CheckSize -- "match (refresh)" --> CheckMTime
    CheckSize -- "no match" --> DoDownload
    
    CheckMTime -- "match" --> DontDownload
    CheckMTime -- "no match" --> DoDownload

    Checksum -- "match" --> DontDownload
    Checksum -- "no match" --> DoDownload
Loading

Where I don't think that refresh and overwrite-different communicate the difference between checking mtime + size vs checksum + size - what is the semantic difference between those paths? sort of hard to me to articulate.

What would make sense to me would be to have a "fast" or "slow" path that goes through the same set of checks:

So eg.

  • dandi download --force
  • dandi download --ignore
  • dandi download --refresh
  • dandi download --refresh --checksum
flowchart TD
    Download
    DoDownload[Do Download]
    DLMode[Download Mode]
    DontDownload[Don't Download]
    CheckSize[Check Size]
    CheckMTime[Check mtime]
    Checksum[Checksum]
    ChecksumEnabled[--checksum]
    
    Download -- "Doesnt Exist" --> DoDownload
    Download -- "Does Exist" --> DLMode
    DLMode -- "--ignore" --> DontDownload
    DLMode -- "--force" --> DoDownload
    DLMode -- "--refresh" --> CheckSize
    CheckSize -- "match" --> CheckMTime
    CheckSize -- "no match" --> DoDownload
    
    CheckMTime -- "match" --> ChecksumEnabled
    CheckMTime -- "no match" --> DoDownload

    ChecksumEnabled -- "yes" --> Checksum
    ChecksumEnabled -- "no" --> DontDownload

    Checksum -- "match" --> DontDownload
    Checksum -- "no match" --> DoDownload
    

Loading

So then the action refresh is the same, but the checksum flag determines whether one does the refresh in a "fast" or "slow" way, where the tradeoff is clearly enough speed vs. certainty. One could set some env or config variable like "always checksum when refreshing" that would be awkward to express if that option was part of the same enum as the download mode.

I think that mtime is a pretty weak check by itself with lots of opportunities for false positives and negatives for an update at the point when it's not an independent representation of version/time of update on server vs. touching the file/modifying the file, so i take no opinion on whether the download function should match the mtimes - to me the question is "how can we make the download command reflect the intention of 'download this and make sure what i have is correct'" and I hope i'm not coming across as too pedantic or ornery here - happy to do whatever with this PR

@yarikoptic
Copy link
Member

I am still not quite sure why overwrite-different = (size + checksum) and refresh = (size + mtime) would be two different paths

if size differs -- checksum cannot[*] be identical. if mtime differs -- checksum still can be the same.

once again -- size check in "checksum" based mode is just a shortcut for the aforementioned reason.

[*] unless there is a weak checksum like md5 and we are under attack... not the case here

@sneakers-the-rat
Copy link
Contributor Author

Right right, understand that, so the size and mtime are "fast" ways to check for mismatches, and checksum is a slower but more accurate check. What i mean when i am saying i'm not sure why those would be two separate code paths is that it seems like overwrite-different and refresh are doing two versions of the same thing, a "fast" and a "slow" version -- we know that if size is different that checksum will fail, but even when they match we might not want to do the checksum if we just want to do a "fast" check (as per the purpose of refresh).

So why --overwrite-different and --refresh rather than --refresh and --refresh --checksum to indicate "fast with potential for false negatives" and "slow but certain" versions of the same intention.

Don't want to get too far into the weeds on this, it seems like in any case this PR is not of interest, which is fine, but wondering what the outcome should be:

  • Close this and pretend it never happened (totally fine)
  • Refactor to just keep the generalization of checking digests part (rather than if else each type of digest) in the OVERWRITE_DIFFERENT leg. If we pick this one i'd also want to contribute to the docs a bit to explain what the different options mean (currently the docs only explain refresh)

It seems like in either case I assume we would probably want to handle change to call naming/etc. in another PR, which I would also be happy to draft.

@sneakers-the-rat
Copy link
Contributor Author

all good, i'm just gonna make my own wrapper <3. thx for yr time on this one, closing to tidy up my lists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants