-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Search] XOR Filters #2925
Comments
Hi @JensUweUlrich, Thanks for the heads up. AFAIK There is currently no plan to implement it. |
Hi @smehringer The paper describing XOR filters can be found here. Best |
There is also a nice, short explanation on Stack Overflow: https://stackoverflow.com/a/67527508 My thoughts:
As always, it really depends on one's needs 😁 |
Thanks for your answer @eseiler I have some applications in mind where all elements are known at the time of construction, e.g. trying to answer if a sequenced read is part of a given set of reference sequences. In such scenarios the XOR filter could be useful. |
Hi @JensUweUlrich, I think for the use case of a fixed reference set XOR filters seem like a good alternative. For example it would be nice if the following code still works if we just plug in #include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/dream_index/interleaved_bloom_filter.hpp>
int main()
{
seqan3::interleaved_bloom_filter ibf{seqan3::bin_count{12u}, seqan3::bin_size{8192u}};
ibf.emplace(126, seqan3::bin_index{0u});
ibf.emplace(712, seqan3::bin_index{3u});
ibf.emplace(237, seqan3::bin_index{9u});
auto agent = ibf.membership_agent();
auto & result = agent.bulk_contains(712);
seqan3::debug_stream << result << '\n'; // prints [0,0,0,1,0,0,0,0,0,0,0,0]
} If you have any questions or facing problems with this design I'd be happy to help. |
Hi @smehringer that was my plan as well. Would be a mess to implement and not share it with the whole seqan community. Cheers |
Hi @smehringer & @eseiler I finished a first naive implementation of the interleaved XOR filter (IXF) and pushed it to my forked seqan3 repo (https://github.com/JensUweUlrich/seqan3). You can have a glance at the code and test it if you like to. Some remarks:
At the end of the day, the IXF can be useful for bigger datasets and applications where low FPR is required, but query time is not crucial. Hope this is useful for some of you. Cheers PS: I also have an experimental implementation of an interleaved Binary Fuse Filter. But it fails construction for bin numbers higher than 300, which is due to the intensive use of hashing and resetting of seeds for different bins. Maybe I will find some time to improve it in the near future. |
Nice work! I have some questions:
The reduction in size is indeed interesting. It's about half of that of an IBF. Could you interpolate if this reduction would be expected to be independent of the number of bins and the data (it's kmer distribution)? We currently reduce the size of the IBF with minimizers which works quite well. So I don't know if we are in need of XOR filters and if they can compete with tools like mantis if the performance is low. |
The FPR depends on the used fingerprint size. That said, the higher the fingerprint size, the lower the FPR. But it is independent on the number of bins
Same issues here. More bins lead to decreased performance. Would be nice to see the HIBF implementation, so I could think about a hierarchical IXF, too.
Possible, would lead to a minimum FPR but you would lose parallel querying of bins.
The reduction is independent of the number of bins or data.
Downsampling or subsampling of entries by using minimizer or syncmers is indeed a good approach but there's definitely a limit for that ( 20% of k-mers if I remember correctly). And as I said, for really large sets of references, the IXF could be an alternative. BTW: Have you already benchmarked the IBF to mantis? |
Regarding runtime, I had a glance at the code, and in a few (> 1) places you do something like
Maybe it was
Otherwise, you'll copy the vectors, but you only read (not write) them; so no copy needed. But now I'm kinda hooked, maybe I'll get around to making the code seqan3 style (you have some code from the original implementation, and they are kinda c-style c++). |
To be honest, I did not invest too much time in code optimization. I'm sure you will find some places in the code where you can tweak performance. But theoretically, the IXF can never reach faster queries than the IBF with more than 8 bins. But if you find it useful or have some ideas for improvement, we can discuss further. I'm happy to help you make seqan even better ;-) |
Question
Hi Seqan-Team,
I just recently read about XOR Filters and that they were superior to Bloom Filters in regards to memory usage and query time. Do you plan (or maybe already started) to implement the data structure for upcoming releases?
Thanks
Jens
The text was updated successfully, but these errors were encountered: