Cache directory fingerprint as a XORed hash of file fingerprints #71

jwodder · 2022-02-18T16:32:34Z

Closes #68.
Closes #19.

Alternative to #70.

yarikoptic · 2022-02-18T16:39:16Z

Also would close #19 ?

jwodder · 2022-02-18T16:41:03Z

@yarikoptic Yes.

yarikoptic · 2022-02-18T16:42:14Z

src/fscacher/cache.py


    def add_file(self, path, fprint: FileFingerprint):
-        self.tree_fprints[path] = fprint
-        if self.last_modified is None or self.last_modified < fprint.mtime_ns:
+        fprint_hash = list(


Why is it a list?

md5().digest() returns bytes, and we need to convert it to a list to get a sequence of ints, which can be XORed.

indeed. somewhat odd that there is no builtin xor over bytes

I have benchmarked some "implementations" I found around and saw no some magical winner to help us out (I was expecting it to perform faster BTW -- it is in µs not ns I was hoping for :-/)

In [6]: def byte_xor(ba1, ba2): ...: return bytes([_a ^ _b for _a, _b in zip(ba1, ba2)]) ...: In [7]: def sxor(s1,s2): ...: return ''.join(chr(ord(a) ^ ord(b)) for a,b in zip(s1,s2)) ...: In [8]: s1 = "a"*128 In [9]: s2 = "b"*128 In [10]: %timeit sxor(s1, s2) 21.3 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [11]: s1 = "a"*128; b1=s1.encode() In [12]: s2 = "b"*128; b2=s2.encode() In [13]: In [13]: %timeit byte_xor(b1, b2) 11.8 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [14]: def byte_xor(ba1, ba2): ...: return bytes(_a ^ _b for _a, _b in zip(ba1, ba2)) ...: In [15]: %timeit byte_xor(b1, b2) 12.3 µs ± 77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [16]: def bxor(b1, b2): # use xor for bytes ...: parts = [] ...: for b1, b2 in zip(b1, b2): ...: parts.append(bytes([b1 ^ b2])) ...: return b''.join(parts) ...: In [17]: %timeit bxor(b1, b2) 27.5 µs ± 746 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [18]: def bxor(b1, b2): # use xor for bytes ...: result = bytearray() ...: for b1, b2 in zip(b1, b2): ...: result.append(b1 ^ b2) ...: return result ...: In [19]: %timeit bxor(b1, b2) 11.7 µs ± 84.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

and each individual xor invocation is about 100ns

In [23]: xor1=lambda x,y: x^y In [24]: b1_0, b2_0 = (b1[0], b2[0]) In [25]: %timeit xor1(b1_0, b2_0) 104 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

and it is just a bloody simple bit-wise operation which should be bloody fast :-/ there must be some not insane way to bring its speed up and avoid "per byte" xor'ing.

ha -- I think I found the winner!!! https://stackoverflow.com/a/29409299/1265472 takes advantage of int.from_bytes

def encrypt2(var, key, byteorder=sys.byteorder): key, var = key[:len(var)], var[:len(key)] int_var = int.from_bytes(var, byteorder) int_key = int.from_bytes(key, byteorder) int_enc = int_var ^ int_key return int_enc.to_bytes(len(var), byteorder)

seems to be all kosher and the same as more explicit version (but x10 faster on my sample case)

In [15]: s2 = "b"*129; b2=s2.encode() In [16]: s1 = "a"*128; b1=s1.encode() In [17]: byte_xor(b1, b2) == encrypt2(b1, b2) Out[17]: True In [18]: %timeit encrypt2(b1, b2) 968 ns ± 0.345 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [19]: %timeit byte_xor(b1, b2) 10.3 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

the only thing I am not sure is trimming of one value based on the length of the other... (and in original zip implementations too if I got it right)

In [23]: encrypt2(b"123", b"12") Out[23]: b'\x00\x00' In [24]: encrypt2(b"12", b"123") Out[24]: b'\x00\x00'

we might need first to get max len first and pad to avoid collisions

edit: I think it should be ok to do padding with 0s just at that moment of that xor'ing of two values, i.e. there should be no need to pad all values first. but worth testing explicitly that order doesn't matter

here is a little script with removing trimming and a test randomizing the order and seems working Ok

import sys def encrypt2(var, key, byteorder=sys.byteorder): int_var = int.from_bytes(var, byteorder) int_key = int.from_bytes(key, byteorder) int_enc = int_var ^ int_key return int_enc.to_bytes((int_enc.bit_length() + 7)//8, byteorder) def xor_list(l): l = [s.encode() for s in l] if not l: return out = l[0] for v in l[1:]: out = encrypt2(out, v) return out import random random.seed(1) l = ["abc", "a", "122", "", "al/s/1"] #l = ["abc", "a", "122"] # , "", "al/s/1"] r1 = xor_list(l) print(r1) # test that order does not matter for i in range(100): random.shuffle(l) r = xor_list(l) if r1 != r: print(f"DIFF: {r}")

obviously xor_list does not have to be that ugly since there is reduce functools.reduce(encrypt2, [v.encode() for v in l])

Here are the caching timings for a Zarr checksum calculation using this branch with the new bytes-XORing vs. the previous version vs. PR #70, all run against a directory containing 37480 files:

fscacher version Cache Miss Cache Hit

PR #70 63.4592 0.331767

PR #71, previous 67.1527 0.379186

PR #71, current 64.0528 0.328146

Thank you for the timings @jwodder ! I am still really glad to see sub-second performance for "cache hit"s -- that is awesome!!!
as for #70 vs #71 (current) . Probably impact of sorted is not as pronounced as I was afraid:

some crude timing shows that it is still just ~100ms for non-sorted list of 370k elements

In [2]: l = [str(i) for i in range(37000)] In [3]: %timeit sorted(l) 614 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [4]: random.seed(1); random.shuffle(l) In [5]: %timeit sorted(l) 7.29 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [6]: random.seed(2); random.shuffle(l) In [7]: %timeit sorted(l) 7.43 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [8]: l = [str(i) for i in range(370000)] In [9]: %timeit sorted(l) 12.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [10]: random.seed(1); random.shuffle(l) In [11]: %timeit sorted(l) 120 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

but I still feel that it is better to avoid n*log(n) operations (+ need to store long lists causing memory consumption) whenever possible, so I would keep this #71, current.

codecov-commenter · 2022-02-18T16:45:52Z

Codecov Report

Merging #71 (a5f1e86) into master (775daf0) will increase coverage by 2.01%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
+ Coverage   92.61%   94.62%   +2.01%     
==========================================
  Files           3        4       +1     
  Lines         406      595     +189     
  Branches       50      106      +56     
==========================================
+ Hits          376      563     +187     
- Misses         18       19       +1     
- Partials       12       13       +1

Impacted Files	Coverage Δ
src/fscacher/cache.py	`94.27% <90.00%> (+4.11%)`	⬆️
src/fscacher/tests/test_cache.py	`94.61% <100.00%> (+1.00%)`	⬆️
src/fscacher/tests/test_util.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 775daf0...a5f1e86. Read the comment docs.

jwodder · 2022-02-18T16:58:18Z

@yarikoptic All of the benchmarking failures seem to be just for when caching is disabled; #72 should make that faster.

jwodder · 2022-02-18T21:45:02Z

@yarikoptic I used this script to time caching of a very large directory with this PR and #70. Both PRs brought the post-caching lookup time down from 11s to 0.3s, though there wasn't much speed difference between the two PRs.

yarikoptic · 2022-02-18T22:47:55Z

#72 should make that faster.

thanks, merged that one. May be rebase this so we can get some clarity?

Both PRs brought the post-caching lookup time down from 11s to 0.3s,

cool! so it is those 11 sec as in dandi/dandi-cli#913 (comment) table of results?

though there wasn't much speed difference between the two PRs.

that is "curious" since this PR avoids AFAIK n*log(n) sorting operation, so I would have expected it to be more performant... the simplicity of #70 appeals, but I still think that we should squeeze more cycles somehow. I will py-spy later to see where we spend it -- if that is that xoring via lists, we might want to really look into making it lighter (avoiding explicit list/looping)

jwodder · 2022-02-18T22:50:09Z

@yarikoptic Rebased.

cool! so it is those 11 sec as in dandi/dandi-cli#913 (comment) table of results?

Yes.

src/fscacher/cache.py

yarikoptic · 2022-02-21T16:45:41Z

FTR, I do not see any relevant sorted left behind:

(git)lena:~/proj/fscacher[gh-68b]git
$> git grep sorted
src/fscacher/_version.py:        print("likely tags: %s" % ",".join(sorted(tags)))
src/fscacher/_version.py:    for ref in sorted(tags):
tox.ini:sort_relative_in_force_sorted_sections = True
versioneer.py:        print("likely tags: %%s" %% ",".join(sorted(tags)))
versioneer.py:    for ref in sorted(tags):
versioneer.py:        print("likely tags: %s" % ",".join(sorted(tags)))
versioneer.py:    for ref in sorted(tags):

$> git describe
0.1.6-19-g7c1ca60

yarikoptic · 2022-02-22T14:12:12Z

Thank you @jwodder , let's proceed!

jwodder added the performance Improve performance of an existing feature label Feb 18, 2022

yarikoptic reviewed Feb 18, 2022

View reviewed changes

Cache directory fingerprint as a XORed hash of file fingerprints

91b87d7

jwodder force-pushed the gh-68b branch from eac1665 to 91b87d7 Compare February 18, 2022 22:49

Try to make bytes-XORing faster

7c1ca60

jwodder marked this pull request as ready for review February 21, 2022 16:43

yarikoptic reviewed Feb 21, 2022

View reviewed changes

src/fscacher/cache.py Show resolved Hide resolved

Test that file order is irrelevant for DirFingerprints

a5f1e86

yarikoptic merged commit 3c4b426 into master Feb 22, 2022

yarikoptic deleted the gh-68b branch February 22, 2022 14:12

yarikoptic mentioned this pull request Feb 22, 2022

Cache directory fingerprint as a hash of file fingerprints #70

Closed

jwodder mentioned this pull request Feb 22, 2022

Do a multithreaded walk when fingerprinting directories #67

Open

yarikoptic mentioned this pull request Feb 22, 2022

multithread digest'ing of zarr folders dandi/dandi-cli#913

Closed

jwodder mentioned this pull request Feb 20, 2023

Don't sort directory fingerprint tuples #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache directory fingerprint as a XORed hash of file fingerprints #71

Cache directory fingerprint as a XORed hash of file fingerprints #71

jwodder commented Feb 18, 2022 •

edited

Loading

yarikoptic commented Feb 18, 2022

jwodder commented Feb 18, 2022

yarikoptic Feb 18, 2022

jwodder Feb 18, 2022 •

edited

Loading

yarikoptic Feb 18, 2022

yarikoptic Feb 21, 2022

yarikoptic Feb 21, 2022 •

edited

Loading

yarikoptic Feb 21, 2022

yarikoptic Feb 21, 2022

jwodder Feb 21, 2022

yarikoptic Feb 21, 2022

codecov-commenter commented Feb 18, 2022 •

edited

Loading

jwodder commented Feb 18, 2022

jwodder commented Feb 18, 2022

yarikoptic commented Feb 18, 2022

jwodder commented Feb 18, 2022

yarikoptic commented Feb 21, 2022

yarikoptic commented Feb 22, 2022

fscacher version	Cache Miss	Cache Hit
PR #70	63.4592	0.331767
PR #71, previous	67.1527	0.379186
PR #71, current	64.0528	0.328146

Cache directory fingerprint as a XORed hash of file fingerprints #71

Cache directory fingerprint as a XORed hash of file fingerprints #71

Conversation

jwodder commented Feb 18, 2022 • edited Loading

yarikoptic commented Feb 18, 2022

jwodder commented Feb 18, 2022

yarikoptic Feb 18, 2022

Choose a reason for hiding this comment

jwodder Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

yarikoptic Feb 18, 2022

Choose a reason for hiding this comment

yarikoptic Feb 21, 2022

Choose a reason for hiding this comment

yarikoptic Feb 21, 2022 • edited Loading

Choose a reason for hiding this comment

yarikoptic Feb 21, 2022

Choose a reason for hiding this comment

yarikoptic Feb 21, 2022

Choose a reason for hiding this comment

jwodder Feb 21, 2022

Choose a reason for hiding this comment

yarikoptic Feb 21, 2022

Choose a reason for hiding this comment

codecov-commenter commented Feb 18, 2022 • edited Loading

Codecov Report

jwodder commented Feb 18, 2022

jwodder commented Feb 18, 2022

yarikoptic commented Feb 18, 2022

jwodder commented Feb 18, 2022

yarikoptic commented Feb 21, 2022

yarikoptic commented Feb 22, 2022

jwodder commented Feb 18, 2022 •

edited

Loading

jwodder Feb 18, 2022 •

edited

Loading

yarikoptic Feb 21, 2022 •

edited

Loading

codecov-commenter commented Feb 18, 2022 •

edited

Loading