-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check digests when redownloading #1364
Check digests when redownloading #1364
Conversation
- redownload immediately if size or mtime don't match, check digest only if they do. - debug messages on all arms of decision tow download - move import into download function - test using debug messages
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1364 +/- ##
==========================================
- Coverage 88.69% 88.65% -0.04%
==========================================
Files 77 77
Lines 10434 10503 +69
==========================================
+ Hits 9254 9311 +57
- Misses 1180 1192 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Interesting. looks like the windows tests are failing - it's saying that a file of all zeros has the same hash as the real file. I'm assuming that's because the that actually seems like it's coming from the way |
dandi/support/digests.py
Outdated
Only check the first digest that is present in ``digests`` , rather than | ||
checking all and returning if any are ``True`` - save compute time and | ||
honor the notion of priority - some hash algos are better than others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only check the first digest that is present in ``digests`` , rather than | |
checking all and returning if any are ``True`` - save compute time and | |
honor the notion of priority - some hash algos are better than others. | |
Only the first digest algorithm in ``priority`` that is present in | |
``digests`` is checked. |
Describing an approach we're not doing is confusing, and "why" reasons generally don't belong in docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually find "why" explanations helpful - eg. why not have a function that just take one hash rather than a dict of hashes, and if we are passing multiple hashes why wouldn't we check all of them - but sure i'll move that out of the docstring, no strong opinions. the test should protect against future memory loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated docstring to also describe params and that makes previous description of what we're not doing obsolete anyway
unrelated to the destiny of this PR, I would welcome you to try datalad or git/git-annex directly on the https://github.com/dandi/dandisets which should provide quite an up to date and tightly version/digests controlled access to DANDI data. |
dandi/download.py
Outdated
@@ -578,26 +583,16 @@ def _download_file( | |||
and "sha256" in digests | |||
): | |||
if key_parts[-1].partition(".")[0] == digests["sha256"]: | |||
lgr.debug("already exists - matching digest in filename") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is always "not" fun to guess what message is talking about, so better use
lgr.debug("already exists - matching digest in filename") | |
lgr.debug("%s already exists - matching digest in filename", path) |
or alike here and elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Also noticed that there was a checksum
yield message that would apply when we skip because it matches and added that.
I think i made all the suggested changes, only thing that i added is also yielding |
@jwodder any clue why mypy now becomes unhappy? or it is just a coincidence and just something new which came up now?
re windows: needs to be troubleshooted I guess. |
dandi/download.py
Outdated
): | ||
yield _skip_file("already exists") | ||
elif digests is not None and check_digests(path, digests): | ||
lgr.debug(f"{path!r} already exists - matching digest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought, in the light of e.g. dandi/dandi-archive#1750, that we fully avoided using f-strings in logging but at least 2 did sneak in already
❯ git grep 'lgr\.[^(]*(f'
dandi/download.py: lgr.debug(f"{path!r} - same attributes: {same}. Redownloading")
dandi/upload.py: lgr.info(f"Found {len(dandi_files)} files to consider")
I am not that much of a puritan so wouldn't force a chance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha! i didn't realize there was a practical reason for this, thought it was just old style string formatting. did this in bc28daa
@yarikoptic The typing errors are addressed by #1365. |
@jwodder could you please analyze that windows issue, for which @sneakers-the-rat also reported con/fscacher#87 ? |
Merged from
lmk how i can help - i can force the cache to be invalidated in the tests to make them pass, but it seems like it's worth addressing the underlying source of inconsistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review so far.
download(url, tmp_path, get_metadata=False) | ||
dsdir = tmp_path / "000027" | ||
nwb = dsdir / "sub-RAT123" / "sub-RAT123.nwb" | ||
digests = digester(str(nwb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
digests = digester(str(nwb)) | |
digests = digester(nwb) |
os.utime(nwb, (time.time(), mtime)) | ||
|
||
# now original digests should fail since it's a bunch of zeros | ||
zero_digests = digester(str(nwb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero_digests = digester(str(nwb)) | |
zero_digests = digester(nwb) |
assert "successfully downloaded" in last_5.lower() | ||
|
||
# should pass again | ||
redownload_digests = digester(str(nwb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redownload_digests = digester(str(nwb)) | |
redownload_digests = digester(nwb) |
for dtype in redownload_digests.keys(): | ||
assert redownload_digests[dtype] == digests[dtype] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for dtype in redownload_digests.keys(): | |
assert redownload_digests[dtype] == digests[dtype] | |
assert redownload_digests == digests |
lgr.warning( | ||
f"{path!r} - no mtime or ctime in the record, redownloading" | ||
) | ||
if digests is not None and check_digests(path, digests): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic @sneakers-the-rat As I indicated here, making the "refresh" mode compare digests is a change in behavior. It's currently documented to only check size & mtime, and if we were to make it check digests as well, what would be the difference between "refresh" and "overwrite-different"? |
Yes! i am proposing a change in behavior - I wasn't sure what the difference was, so plz lmk if i'm missing some context here. It seems like the intention of both of them is "check if we have the file we're about to download, if so skip." The current
So maybe we need a change in semantics? We could collapse those down into one download mode. If the difference between the two modes at the moment is check for sameness with time and size vs. hash then i think the names could be made a little clearer. since
which could also move downloading closer to uploading semantics: eg currently:
we could do
which isn't perfectly symmetric, but uploading and downloading aren't symmetric actions either in this case. a humble proposal! |
Hi @sneakers-the-rat and sorry for the delay on providing more feedback here. Indeed there seems to be some need/desire to streamline modes more! The original desire behind
I guess we could gain
It would be different from
|
Gotcha, makes sense. So one option is to make a separate action for The basic tradeoff is certainty vs. speed, right? The purpose of checksumming is to give an option where someone can be certain that what they have locally is the same thing as what's on the server without needing to do the full download. ie. there isn't a case where it would be desirable to have a false negative for triggering some update except for time.
I am still not quite sure why flowchart TD
Download
DoDownload[Do Download]
DLMode[Download Mode]
DontDownload[Don't Download]
CheckSize[Check Size]
CheckMTime[Check mtime]
Checksum[Checksum]
Download -- "Doesnt Exist" --> DoDownload
Download -- "Does Exist" --> DLMode
DLMode -- "`ignore`" --> DontDownload
DLMode -- "overwrite" --> DoDownload
DLMode -- "overwrite-different" --> CheckSize
DLMode -- "refresh" --> CheckSize
CheckSize -- "match (overwrite-different)" --> Checksum
CheckSize -- "match (refresh)" --> CheckMTime
CheckSize -- "no match" --> DoDownload
CheckMTime -- "match" --> DontDownload
CheckMTime -- "no match" --> DoDownload
Checksum -- "match" --> DontDownload
Checksum -- "no match" --> DoDownload
Where I don't think that What would make sense to me would be to have a "fast" or "slow" path that goes through the same set of checks: So eg.
flowchart TD
Download
DoDownload[Do Download]
DLMode[Download Mode]
DontDownload[Don't Download]
CheckSize[Check Size]
CheckMTime[Check mtime]
Checksum[Checksum]
ChecksumEnabled[--checksum]
Download -- "Doesnt Exist" --> DoDownload
Download -- "Does Exist" --> DLMode
DLMode -- "--ignore" --> DontDownload
DLMode -- "--force" --> DoDownload
DLMode -- "--refresh" --> CheckSize
CheckSize -- "match" --> CheckMTime
CheckSize -- "no match" --> DoDownload
CheckMTime -- "match" --> ChecksumEnabled
CheckMTime -- "no match" --> DoDownload
ChecksumEnabled -- "yes" --> Checksum
ChecksumEnabled -- "no" --> DontDownload
Checksum -- "match" --> DontDownload
Checksum -- "no match" --> DoDownload
So then the action I think that |
if size differs -- checksum cannot[*] be identical. if once again -- [*] unless there is a weak checksum like md5 and we are under attack... not the case here |
Right right, understand that, so the So why Don't want to get too far into the weeds on this, it seems like in any case this PR is not of interest, which is fine, but wondering what the outcome should be:
It seems like in either case I assume we would probably want to handle change to call naming/etc. in another PR, which I would also be happy to draft. |
all good, i'm just gonna make my own wrapper <3. thx for yr time on this one, closing to tidy up my lists. |
Hello again (again!)
Hoping this unsolicited PR isn't unwelcome - figured it would be lightweight enough that it didn't need to be pre-checked.
This PR checks file digests when deciding to re-download a file or not. Basically it implements this TODO:
In downloading datasets I had one disk go down, and IT keeps messing with my network so i get intermittent connection breaks that cause downloads to stall, and so I am pretty worried about corrupted data through all that. It makes sense to me to check file hashes at the time of deciding to redownload (so semantically, attempting to download a file to the same location always results in the same file for a given dandiset and version). Thinking about the distributed future of dandi, we'll need to be doing a whole lot of hashing - so eg. this is a step towards being able to validate data from a mirror server against the hashes hosted on the 'main' server.
Pretty straightforward:
check_digests
- i noticed that in the relevant places that there were several possible digests that could be checked against, so to tidy up a fewif/else
s i just put that into one function that checks hashes in order of priority. I do this rather than ensuring that all hashes match or just testing the first one because not all hashing functions are created equal (eg. md5 has been broken for >10 years and should be deprecated) so we do want to explicitly prefer some, esp. since in the future more will break.OVERWRITE_DIFFERENT
leg is unchanged - just refactored usingcheck_digests
REFRESH
...digests
isNone
or an empty dict, behavior is unchanged.mtime
is missing and we have a digest, check it. don't redownload if digest matches.mtime
, first check it andsize
. If either doesn't match, redownload without checking digestmtime
andsize
match, and we have a digest, then check it. if it matches, don't redownload.check_digests
function and to ensure that files that match size and mtime but don't match hash trigger a re-download.Tests pass, ran precommit actions. Not sure why mypy is failing on the
ZarrChecksumTree
import, i didn't touch that code. I see the code coverage checker complaining, i didn't see tests for the_download_file
function itself in the other tests, which i'd need to do to check against a dandiset without a digest, but i can definitely do that if needed.Potential changes
_download_file
is now quite long, and it seems like a separate enough operation that it should be split out into its own function, but didn't want to do too much in one PR without checking.digest
command? The currentdandi digest
command just computes the hash, but we could also have it check it against the server hashes as well?probably more! lmk what needs to be changed :).
Side note: I followed the
DEVELOPMENT.md
guide to run local tests, but the vendored docker-compose files didn't work, and the fixtures aren't set up to use the dandi-archive docker container i had set up outside of thedandi-cli
repo. I had to add an extraDANDI_DOCKER_DIR
env variable to replace theLOCAL_DOCKER_DIR
which is hardcoded to a directory beneathdandi/tests/fixtures
Pretty sure i followed instructions, and it looked like some parts of that doc have gotten out of sync with the fixtures. I also had to override theversioneer
auto-versioning which caused ~half of the tests to fail since my version was something likedandi-0+untagged.3399.g20efd37
since tags don't get fetched for forks by default.