-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache directory fingerprint as a XORed hash of file fingerprints #71
Conversation
Also would close #19 ? |
@yarikoptic Yes. |
src/fscacher/cache.py
Outdated
|
||
def add_file(self, path, fprint: FileFingerprint): | ||
self.tree_fprints[path] = fprint | ||
if self.last_modified is None or self.last_modified < fprint.mtime_ns: | ||
fprint_hash = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md5().digest()
returns bytes
, and we need to convert it to a list to get a sequence of ints, which can be XORed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed. somewhat odd that there is no builtin xor over bytes
I have benchmarked some "implementations" I found around and saw no some magical winner to help us out (I was expecting it to perform faster BTW -- it is in µs not ns I was hoping for :-/)
In [6]: def byte_xor(ba1, ba2):
...: return bytes([_a ^ _b for _a, _b in zip(ba1, ba2)])
...:
In [7]: def sxor(s1,s2):
...: return ''.join(chr(ord(a) ^ ord(b)) for a,b in zip(s1,s2))
...:
In [8]: s1 = "a"*128
In [9]: s2 = "b"*128
In [10]: %timeit sxor(s1, s2)
21.3 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: s1 = "a"*128; b1=s1.encode()
In [12]: s2 = "b"*128; b2=s2.encode()
In [13]:
In [13]: %timeit byte_xor(b1, b2)
11.8 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: def byte_xor(ba1, ba2):
...: return bytes(_a ^ _b for _a, _b in zip(ba1, ba2))
...:
In [15]: %timeit byte_xor(b1, b2)
12.3 µs ± 77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: def bxor(b1, b2): # use xor for bytes
...: parts = []
...: for b1, b2 in zip(b1, b2):
...: parts.append(bytes([b1 ^ b2]))
...: return b''.join(parts)
...:
In [17]: %timeit bxor(b1, b2)
27.5 µs ± 746 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [18]: def bxor(b1, b2): # use xor for bytes
...: result = bytearray()
...: for b1, b2 in zip(b1, b2):
...: result.append(b1 ^ b2)
...: return result
...:
In [19]: %timeit bxor(b1, b2)
11.7 µs ± 84.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and each individual xor invocation is about 100ns
In [23]: xor1=lambda x,y: x^y
In [24]: b1_0, b2_0 = (b1[0], b2[0])
In [25]: %timeit xor1(b1_0, b2_0)
104 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
and it is just a bloody simple bit-wise operation which should be bloody fast :-/ there must be some not insane way to bring its speed up and avoid "per byte" xor'ing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha -- I think I found the winner!!! https://stackoverflow.com/a/29409299/1265472 takes advantage of int.from_bytes
def encrypt2(var, key, byteorder=sys.byteorder):
key, var = key[:len(var)], var[:len(key)]
int_var = int.from_bytes(var, byteorder)
int_key = int.from_bytes(key, byteorder)
int_enc = int_var ^ int_key
return int_enc.to_bytes(len(var), byteorder)
seems to be all kosher and the same as more explicit version (but x10 faster on my sample case)
In [15]: s2 = "b"*129; b2=s2.encode()
In [16]: s1 = "a"*128; b1=s1.encode()
In [17]: byte_xor(b1, b2) == encrypt2(b1, b2)
Out[17]: True
In [18]: %timeit encrypt2(b1, b2)
968 ns ± 0.345 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [19]: %timeit byte_xor(b1, b2)
10.3 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only thing I am not sure is trimming of one value based on the length of the other... (and in original zip implementations too if I got it right)
In [23]: encrypt2(b"123", b"12")
Out[23]: b'\x00\x00'
In [24]: encrypt2(b"12", b"123")
Out[24]: b'\x00\x00'
we might need first to get max len first and pad to avoid collisions
edit: I think it should be ok to do padding with 0s just at that moment of that xor'ing of two values, i.e. there should be no need to pad all values first. but worth testing explicitly that order doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is a little script with removing trimming and a test randomizing the order and seems working Ok
import sys
def encrypt2(var, key, byteorder=sys.byteorder):
int_var = int.from_bytes(var, byteorder)
int_key = int.from_bytes(key, byteorder)
int_enc = int_var ^ int_key
return int_enc.to_bytes((int_enc.bit_length() + 7)//8, byteorder)
def xor_list(l):
l = [s.encode() for s in l]
if not l:
return
out = l[0]
for v in l[1:]:
out = encrypt2(out, v)
return out
import random
random.seed(1)
l = ["abc", "a", "122", "", "al/s/1"]
#l = ["abc", "a", "122"] # , "", "al/s/1"]
r1 = xor_list(l)
print(r1)
# test that order does not matter
for i in range(100):
random.shuffle(l)
r = xor_list(l)
if r1 != r:
print(f"DIFF: {r}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously xor_list
does not have to be that ugly since there is reduce functools.reduce(encrypt2, [v.encode() for v in l])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the caching timings for a Zarr checksum calculation using this branch with the new bytes-XORing vs. the previous version vs. PR #70, all run against a directory containing 37480 files:
fscacher version | Cache Miss | Cache Hit |
---|---|---|
PR #70 | 63.4592 | 0.331767 |
PR #71, previous | 67.1527 | 0.379186 |
PR #71, current | 64.0528 | 0.328146 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the timings @jwodder ! I am still really glad to see sub-second performance for "cache hit"s -- that is awesome!!!
as for #70 vs #71 (current) . Probably impact of sorted
is not as pronounced as I was afraid:
some crude timing shows that it is still just ~100ms for non-sorted list of 370k elements
In [2]: l = [str(i) for i in range(37000)]
In [3]: %timeit sorted(l)
614 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: random.seed(1); random.shuffle(l)
In [5]: %timeit sorted(l)
7.29 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: random.seed(2); random.shuffle(l)
In [7]: %timeit sorted(l)
7.43 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: l = [str(i) for i in range(370000)]
In [9]: %timeit sorted(l)
12.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: random.seed(1); random.shuffle(l)
In [11]: %timeit sorted(l)
120 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
but I still feel that it is better to avoid n*log(n)
operations (+ need to store long lists causing memory consumption) whenever possible, so I would keep this #71, current.
Codecov Report
@@ Coverage Diff @@
## master #71 +/- ##
==========================================
+ Coverage 92.61% 94.62% +2.01%
==========================================
Files 3 4 +1
Lines 406 595 +189
Branches 50 106 +56
==========================================
+ Hits 376 563 +187
- Misses 18 19 +1
- Partials 12 13 +1
Continue to review full report at Codecov.
|
@yarikoptic All of the benchmarking failures seem to be just for when caching is disabled; #72 should make that faster. |
@yarikoptic I used this script to time caching of a very large directory with this PR and #70. Both PRs brought the post-caching lookup time down from 11s to 0.3s, though there wasn't much speed difference between the two PRs. |
thanks, merged that one. May be rebase this so we can get some clarity?
cool! so it is those 11 sec as in dandi/dandi-cli#913 (comment) table of results?
that is "curious" since this PR avoids AFAIK |
@yarikoptic Rebased.
Yes. |
FTR, I do not see any relevant (git)lena:~/proj/fscacher[gh-68b]git
$> git grep sorted
src/fscacher/_version.py: print("likely tags: %s" % ",".join(sorted(tags)))
src/fscacher/_version.py: for ref in sorted(tags):
tox.ini:sort_relative_in_force_sorted_sections = True
versioneer.py: print("likely tags: %%s" %% ",".join(sorted(tags)))
versioneer.py: for ref in sorted(tags):
versioneer.py: print("likely tags: %s" % ",".join(sorted(tags)))
versioneer.py: for ref in sorted(tags):
$> git describe
0.1.6-19-g7c1ca60 |
Thank you @jwodder , let's proceed! |
Closes #68.
Closes #19.
Alternative to #70.