One extra bit (almost) from H2() in Swiss Tables? #1605

sh1boot · 2024-01-21T06:01:50Z

sh1boot
Jan 21, 2024

I'm actually coming here from https://youtu.be/ncHmEUmJZf4 and have only read the code to confirm that it's the same stuff, so it's possible I missed something in the details, here...

I understand the H2() table is either a seven-bit hash, or a handful of sentinel values -- like, two or three of them -- so there's a big gap between 128 and 253-ish of unassigned values. Is that still the case?

Maybe it would make sense to define H2() as something like:

inline h2_t H2(size_t hash) { return uint128_t(hash) * 253 >> 64; }

so that the hash covers a greater range of non-sentinel values to bring the false positive rate down a bit (well, almost a bit)?

Optionally, the low part of the multiplication could be retained as a slight improvement on the avalanching of the original hash, since it's free at this point. The low part of the product comes out for free when calculating the high part, and the low part is x*253%(2**64), which is a bijective permutation of the original value (if I have the available range wrong then just pick any odd number).

I'm not sure how slow the multiplier is, these days, or whether the false positive rate comes down far enough to offset that cost. There are a few alternative methods, though. If the distribution bias isn't a concern (ie., some hashes could have distinctly worse false-positive rates than other hashes and that would be OK) then reducing 0..255 to 0..252 can be as simple as x - (x >> 5).

I think you get one eliminated bit-shift operation in return, because if my number theory is correct (don't bet on this!), H1() no longer needs to right-shift the original hash to make it independent[*] of H2().

Anyway, I'm writing all this down because I'm afraid I won't get around to testing it for myself, and even if I do I won't believe my own test results are representative because I don't know what I should use for a benchmark. If nothing else, what benchmark should I be using?

[*] mostly independent -- leakage is probably smaller than flaws in the input hash.

fowles · 2024-01-22T16:15:32Z

fowles
Jan 22, 2024

https://github.com/google/hashtable-benchmarks has a first cut of benchmarks that you can use. If you get a working change we can test it on websearch pretty easily and get back to you. I suspect the difficulty you will have is that some of the SSE probing stuff uses the fact that a high bit of zero means one class of things and a high bit of one means a different class of things pretty heavily.

There are currently three sentinels

The relevant mask bits are for empty and empty or deleted

0 replies

sh1boot · 2024-01-29T22:30:43Z

sh1boot
Jan 29, 2024
Author

I'll see if I can find time. hashtable-benchmarks doesn't build out of the box, by the way, and has PRs waiting to help fix it.

1 reply

fowles Jan 29, 2024

Sorry about that. I have merged those PRs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One extra bit (almost) from H2() in Swiss Tables? #1605

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

One extra bit (almost) from H2() in Swiss Tables? #1605

sh1boot Jan 21, 2024

Replies: 2 comments · 1 reply

fowles Jan 22, 2024

sh1boot Jan 29, 2024 Author

fowles Jan 29, 2024

sh1boot
Jan 21, 2024

Replies: 2 comments 1 reply

fowles
Jan 22, 2024

sh1boot
Jan 29, 2024
Author