Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not fscache individual files digests for zarr-checksum #914

Closed
yarikoptic opened this issue Feb 16, 2022 · 0 comments · Fixed by #923
Closed

do not fscache individual files digests for zarr-checksum #914

yarikoptic opened this issue Feb 16, 2022 · 0 comments · Fixed by #923
Assignees
Labels

Comments

@yarikoptic
Copy link
Member

A follow up to #913 which might have more timing information. For that issue timings I disabled (after first physically moving aside the entire fsccher's cache for dandi-digests) fscacher, and the run took about 23 seconds. Whenever I stopped disabling cache, the run took over 5 minutes (so fscacher gave 1500% overhead if I got it right):

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:02:54,861 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185712Z-5383.log

real    5m43.311s
user    0m24.188s
sys     0m13.128s

and rerunning, took 4 sec (which is better than original 24sec, but slower than just full recompute could be, see #913):

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:05:03,854 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216190459Z-5515.log

real    0m4.576s
user    0m2.878s
sys     0m4.593s

and that is using fscacher of con/fscacher#67 .... may be that one needs really needs to become more efficient .

FWIW

with this patch I disabled caching individual files digests but added one for zarr folder
(dandi-devel) jovyan@jupyter-yarikoptic:~/dandi-cli$ git diff
diff --git a/dandi/support/digests.py b/dandi/support/digests.py
index 2226ea8..74e5199 100644
--- a/dandi/support/digests.py
+++ b/dandi/support/digests.py
@@ -81,7 +81,7 @@ class Digester:
 checksums = PersistentCache(name="dandi-checksums", envvar="DANDI_CACHE")
 
 
-@checksums.memoize_path
+#@checksums.memoize_path
 def get_digest(filepath: Union[str, Path], digest: str = "sha256") -> str:
     if digest == "dandi-etag":
         return cast(str, get_dandietag(filepath).as_str())
@@ -96,6 +96,7 @@ def get_dandietag(filepath: Union[str, Path]) -> DandiETag:
     return DandiETag.from_file(filepath)
 
 
+@checksums.memoize_path
 def get_zarr_checksum(
     path: Path,
     basepath: Optional[Path] = None,
and it ran "fast" in those original 22 sec, and reloaded result in fscacher in its 3-4 sec
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ mv  ~/.cache/fscacher/dandi-checksums{,.aside2}
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:22:10,832 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192149Z-6188.log

real    0m22.338s
user    0m8.445s
sys     0m6.815s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:23:02,610 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192259Z-8240.log

real    0m3.401s
user    0m2.926s
sys     0m5.002s

Meanwhile, I think it would be worth to just disable the fscaching of both individual file digests in the zarr archive (overhead from storing that many digests on initial run seems too great to ignore) and zarr folder altogether (until we make fscaching of folder more efficient).

Not sure if alternatively we should/could come up with some smarter policy/specification for what to cache, e.g. we could parametrize memoize_path to cache only if a file is larger than some specified size (e.g. 500KB or 1MB in the case of digests), but it would still need some io.stat to make the decision etc thus altogether might still have some overhead... but I think it might make it more flexible/generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants