do not fscache individual files digests for zarr-checksum #914

yarikoptic · 2022-02-16T19:28:56Z

A follow up to #913 which might have more timing information. For that issue timings I disabled (after first physically moving aside the entire fsccher's cache for dandi-digests) fscacher, and the run took about 23 seconds. Whenever I stopped disabling cache, the run took over 5 minutes (so fscacher gave 1500% overhead if I got it right):

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:02:54,861 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216185712Z-5383.log

real    5m43.311s
user    0m24.188s
sys     0m13.128s

and rerunning, took 4 sec (which is better than original 24sec, but slower than just full recompute could be, see #913):

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:05:03,854 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216190459Z-5515.log

real    0m4.576s
user    0m2.878s
sys     0m4.593s

and that is using fscacher of con/fscacher#67 .... may be that one needs really needs to become more efficient .

FWIW

with this patch I disabled caching individual files digests but added one for zarr folder

(dandi-devel) jovyan@jupyter-yarikoptic:~/dandi-cli$ git diff
diff --git a/dandi/support/digests.py b/dandi/support/digests.py
index 2226ea8..74e5199 100644
--- a/dandi/support/digests.py
+++ b/dandi/support/digests.py
@@ -81,7 +81,7 @@ class Digester:
 checksums = PersistentCache(name="dandi-checksums", envvar="DANDI_CACHE")
 
 
-@checksums.memoize_path
+#@checksums.memoize_path
 def get_digest(filepath: Union[str, Path], digest: str = "sha256") -> str:
     if digest == "dandi-etag":
         return cast(str, get_dandietag(filepath).as_str())
@@ -96,6 +96,7 @@ def get_dandietag(filepath: Union[str, Path]) -> DandiETag:
     return DandiETag.from_file(filepath)
 
 
+@checksums.memoize_path
 def get_zarr_checksum(
     path: Path,
     basepath: Optional[Path] = None,

and it ran "fast" in those original 22 sec, and reloaded result in fscacher in its 3-4 sec

(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ mv  ~/.cache/fscacher/dandi-checksums{,.aside2}
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:22:10,832 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192149Z-6188.log

real    0m22.338s
user    0m8.445s
sys     0m6.815s
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/0
test64.ngff/0/0/0/0: 53ad9d9fe8b4f34882fdaea76599da22
2022-02-16 19:23:02,610 [    INFO] Logs saved in /home/jovyan/.cache/dandi-cli/log/20220216192259Z-8240.log

real    0m3.401s
user    0m2.926s
sys     0m5.002s

Meanwhile, I think it would be worth to just disable the fscaching of both individual file digests in the zarr archive (overhead from storing that many digests on initial run seems too great to ignore) and zarr folder altogether (until we make fscaching of folder more efficient).

Not sure if alternatively we should/could come up with some smarter policy/specification for what to cache, e.g. we could parametrize memoize_path to cache only if a file is larger than some specified size (e.g. 500KB or 1MB in the case of digests), but it would still need some io.stat to make the decision etc thus altogether might still have some overhead... but I think it might make it more flexible/generic.

The text was updated successfully, but these errors were encountered:

yarikoptic assigned jwodder Feb 16, 2022

yarikoptic mentioned this issue Feb 16, 2022

zarr upload: do not pre-digest anything #915

Closed

This was referenced Feb 21, 2022

multithread digest'ing of zarr folders #913

Closed

Minimize/optimize Zarr digestion when uploading #923

Merged

jwodder added the zarr label Feb 23, 2022

yarikoptic closed this as completed in #923 Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not fscache individual files digests for zarr-checksum #914

do not fscache individual files digests for zarr-checksum #914

yarikoptic commented Feb 16, 2022

do not fscache individual files digests for zarr-checksum #914

do not fscache individual files digests for zarr-checksum #914

Comments

yarikoptic commented Feb 16, 2022