Replies: 2 comments 1 reply
-
https://github.com/google/hashtable-benchmarks has a first cut of benchmarks that you can use. If you get a working change we can test it on websearch pretty easily and get back to you. I suspect the difficulty you will have is that some of the SSE probing stuff uses the fact that a high bit of zero means one class of things and a high bit of one means a different class of things pretty heavily. There are currently three sentinels The relevant mask bits are for empty and empty or deleted |
Beta Was this translation helpful? Give feedback.
-
I'll see if I can find time. hashtable-benchmarks doesn't build out of the box, by the way, and has PRs waiting to help fix it. |
Beta Was this translation helpful? Give feedback.
-
I'm actually coming here from https://youtu.be/ncHmEUmJZf4 and have only read the code to confirm that it's the same stuff, so it's possible I missed something in the details, here...
I understand the
H2()
table is either a seven-bit hash, or a handful of sentinel values -- like, two or three of them -- so there's a big gap between 128 and 253-ish of unassigned values. Is that still the case?Maybe it would make sense to define
H2()
as something like:so that the hash covers a greater range of non-sentinel values to bring the false positive rate down a bit (well, almost a bit)?
Optionally, the low part of the multiplication could be retained as a slight improvement on the avalanching of the original hash, since it's free at this point. The low part of the product comes out for free when calculating the high part, and the low part is x*253%(2**64), which is a bijective permutation of the original value (if I have the available range wrong then just pick any odd number).
I'm not sure how slow the multiplier is, these days, or whether the false positive rate comes down far enough to offset that cost. There are a few alternative methods, though. If the distribution bias isn't a concern (ie., some hashes could have distinctly worse false-positive rates than other hashes and that would be OK) then reducing 0..255 to 0..252 can be as simple as
x - (x >> 5)
.I think you get one eliminated bit-shift operation in return, because if my number theory is correct (don't bet on this!),
H1()
no longer needs to right-shift the original hash to make it independent[*] ofH2()
.Anyway, I'm writing all this down because I'm afraid I won't get around to testing it for myself, and even if I do I won't believe my own test results are representative because I don't know what I should use for a benchmark. If nothing else, what benchmark should I be using?
[*] mostly independent -- leakage is probably smaller than flaws in the input hash.
Beta Was this translation helpful? Give feedback.
All reactions