Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache directory fingerprint as a XORed hash of file fingerprints #71

Merged
merged 3 commits into from
Feb 22, 2022

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Feb 18, 2022

Closes #68.
Closes #19.

Alternative to #70.

@jwodder jwodder added the performance Improve performance of an existing feature label Feb 18, 2022
@yarikoptic
Copy link
Member

Also would close #19 ?

@jwodder
Copy link
Member Author

jwodder commented Feb 18, 2022

@yarikoptic Yes.


def add_file(self, path, fprint: FileFingerprint):
self.tree_fprints[path] = fprint
if self.last_modified is None or self.last_modified < fprint.mtime_ns:
fprint_hash = list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a list?

Copy link
Member Author

@jwodder jwodder Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md5().digest() returns bytes, and we need to convert it to a list to get a sequence of ints, which can be XORed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. somewhat odd that there is no builtin xor over bytes

I have benchmarked some "implementations" I found around and saw no some magical winner to help us out (I was expecting it to perform faster BTW -- it is in µs not ns I was hoping for :-/)
In [6]: def byte_xor(ba1, ba2):
   ...:     return bytes([_a ^ _b for _a, _b in zip(ba1, ba2)])
   ...:     

In [7]: def sxor(s1,s2):    
   ...:     return ''.join(chr(ord(a) ^ ord(b)) for a,b in zip(s1,s2))
   ...:     

In [8]: s1 = "a"*128

In [9]: s2 = "b"*128

In [10]: %timeit sxor(s1, s2)
21.3 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: s1 = "a"*128; b1=s1.encode()

In [12]: s2 = "b"*128; b2=s2.encode()

In [13]: 

In [13]: %timeit byte_xor(b1, b2)
11.8 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: def byte_xor(ba1, ba2):
    ...:     return bytes(_a ^ _b for _a, _b in zip(ba1, ba2))
    ...:     

In [15]: %timeit byte_xor(b1, b2)
12.3 µs ± 77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [16]: def bxor(b1, b2): # use xor for bytes
    ...:     parts = []
    ...:     for b1, b2 in zip(b1, b2):
    ...:         parts.append(bytes([b1 ^ b2]))
    ...:     return b''.join(parts)
    ...:     

In [17]: %timeit bxor(b1, b2)
27.5 µs ± 746 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [18]: def bxor(b1, b2): # use xor for bytes
    ...:     result = bytearray()
    ...:     for b1, b2 in zip(b1, b2):
    ...:         result.append(b1 ^ b2)
    ...:     return result
    ...:     

In [19]: %timeit bxor(b1, b2)
11.7 µs ± 84.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and each individual xor invocation is about 100ns
In [23]: xor1=lambda x,y: x^y

In [24]: b1_0, b2_0 = (b1[0], b2[0])

In [25]: %timeit xor1(b1_0, b2_0)
104 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

and it is just a bloody simple bit-wise operation which should be bloody fast :-/ there must be some not insane way to bring its speed up and avoid "per byte" xor'ing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha -- I think I found the winner!!! https://stackoverflow.com/a/29409299/1265472 takes advantage of int.from_bytes

    def encrypt2(var, key, byteorder=sys.byteorder):
        key, var = key[:len(var)], var[:len(key)]
        int_var = int.from_bytes(var, byteorder)
        int_key = int.from_bytes(key, byteorder)
        int_enc = int_var ^ int_key
        return int_enc.to_bytes(len(var), byteorder)
seems to be all kosher and the same as more explicit version (but x10 faster on my sample case)
In [15]: s2 = "b"*129; b2=s2.encode()

In [16]: s1 = "a"*128; b1=s1.encode()

In [17]: byte_xor(b1, b2) == encrypt2(b1, b2)
Out[17]: True

In [18]: %timeit encrypt2(b1, b2)
968 ns ± 0.345 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [19]: %timeit byte_xor(b1, b2)
10.3 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Member

@yarikoptic yarikoptic Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing I am not sure is trimming of one value based on the length of the other... (and in original zip implementations too if I got it right)

In [23]: encrypt2(b"123", b"12")
Out[23]: b'\x00\x00'

In [24]: encrypt2(b"12", b"123")
Out[24]: b'\x00\x00'

we might need first to get max len first and pad to avoid collisions

edit: I think it should be ok to do padding with 0s just at that moment of that xor'ing of two values, i.e. there should be no need to pad all values first. but worth testing explicitly that order doesn't matter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a little script with removing trimming and a test randomizing the order and seems working Ok
import sys

def encrypt2(var, key, byteorder=sys.byteorder):
    int_var = int.from_bytes(var, byteorder)
    int_key = int.from_bytes(key, byteorder)
    int_enc = int_var ^ int_key
    return int_enc.to_bytes((int_enc.bit_length() + 7)//8, byteorder)

def xor_list(l):
    l = [s.encode() for s in l]
    if not l:
        return
    out = l[0]
    for v in l[1:]:
        out = encrypt2(out, v)
    return out

import random
random.seed(1)

l = ["abc", "a", "122", "", "al/s/1"]
#l = ["abc", "a", "122"] # , "", "al/s/1"]


r1 = xor_list(l)
print(r1)

# test that order does not matter
for i in range(100):
    random.shuffle(l)
    r = xor_list(l)
    if r1 != r:
        print(f"DIFF: {r}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously xor_list does not have to be that ugly since there is reduce functools.reduce(encrypt2, [v.encode() for v in l])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the caching timings for a Zarr checksum calculation using this branch with the new bytes-XORing vs. the previous version vs. PR #70, all run against a directory containing 37480 files:

fscacher version Cache Miss Cache Hit
PR #70 63.4592 0.331767
PR #71, previous 67.1527 0.379186
PR #71, current 64.0528 0.328146

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the timings @jwodder ! I am still really glad to see sub-second performance for "cache hit"s -- that is awesome!!!
as for #70 vs #71 (current) . Probably impact of sorted is not as pronounced as I was afraid:

some crude timing shows that it is still just ~100ms for non-sorted list of 370k elements
In [2]: l = [str(i) for i in range(37000)]

In [3]: %timeit sorted(l)
614 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: random.seed(1); random.shuffle(l)

In [5]: %timeit sorted(l)
7.29 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: random.seed(2); random.shuffle(l)

In [7]: %timeit sorted(l)
7.43 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: l = [str(i) for i in range(370000)]

In [9]: %timeit sorted(l)
12.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: random.seed(1); random.shuffle(l)

In [11]: %timeit sorted(l)
120 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

but I still feel that it is better to avoid n*log(n) operations (+ need to store long lists causing memory consumption) whenever possible, so I would keep this #71, current.

@codecov-commenter
Copy link

codecov-commenter commented Feb 18, 2022

Codecov Report

Merging #71 (a5f1e86) into master (775daf0) will increase coverage by 2.01%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
+ Coverage   92.61%   94.62%   +2.01%     
==========================================
  Files           3        4       +1     
  Lines         406      595     +189     
  Branches       50      106      +56     
==========================================
+ Hits          376      563     +187     
- Misses         18       19       +1     
- Partials       12       13       +1     
Impacted Files Coverage Δ
src/fscacher/cache.py 94.27% <90.00%> (+4.11%) ⬆️
src/fscacher/tests/test_cache.py 94.61% <100.00%> (+1.00%) ⬆️
src/fscacher/tests/test_util.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 775daf0...a5f1e86. Read the comment docs.

@jwodder
Copy link
Member Author

jwodder commented Feb 18, 2022

@yarikoptic All of the benchmarking failures seem to be just for when caching is disabled; #72 should make that faster.

@jwodder
Copy link
Member Author

jwodder commented Feb 18, 2022

@yarikoptic I used this script to time caching of a very large directory with this PR and #70. Both PRs brought the post-caching lookup time down from 11s to 0.3s, though there wasn't much speed difference between the two PRs.

@yarikoptic
Copy link
Member

#72 should make that faster.

thanks, merged that one. May be rebase this so we can get some clarity?

Both PRs brought the post-caching lookup time down from 11s to 0.3s,

cool! so it is those 11 sec as in dandi/dandi-cli#913 (comment) table of results?

though there wasn't much speed difference between the two PRs.

that is "curious" since this PR avoids AFAIK n*log(n) sorting operation, so I would have expected it to be more performant... the simplicity of #70 appeals, but I still think that we should squeeze more cycles somehow. I will py-spy later to see where we spend it -- if that is that xoring via lists, we might want to really look into making it lighter (avoiding explicit list/looping)

@jwodder
Copy link
Member Author

jwodder commented Feb 18, 2022

@yarikoptic Rebased.

cool! so it is those 11 sec as in dandi/dandi-cli#913 (comment) table of results?

Yes.

@jwodder jwodder marked this pull request as ready for review February 21, 2022 16:43
@yarikoptic
Copy link
Member

FTR, I do not see any relevant sorted left behind:

(git)lena:~/proj/fscacher[gh-68b]git
$> git grep sorted
src/fscacher/_version.py:        print("likely tags: %s" % ",".join(sorted(tags)))
src/fscacher/_version.py:    for ref in sorted(tags):
tox.ini:sort_relative_in_force_sorted_sections = True
versioneer.py:        print("likely tags: %%s" %% ",".join(sorted(tags)))
versioneer.py:    for ref in sorted(tags):
versioneer.py:        print("likely tags: %s" % ",".join(sorted(tags)))
versioneer.py:    for ref in sorted(tags):

$> git describe
0.1.6-19-g7c1ca60

@yarikoptic
Copy link
Member

Thank you @jwodder , let's proceed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

from joblib: UserWarning: Persisting input arguments took XXX to run. directories: avoid full sort
3 participants