Skip to content

Too large minhashLSH index #207

@bryanyzhu

Description

@bryanyzhu

Hi, I have a question about large-scale LSH index. If I have billions of documents, I suppose even 1T RAM is not enough to do in-memory LSH, is there any recommended way to use datasketch for this scenario? Thank you.

I also opened an issue #206 because for a small subset on my local machine (a 6GB pickle file containing pre-computed minhashes), if I use LSH threhold of 0.5 for inserting to minhashLSH, it takes 31GB RAM. If I use leanminhash, it takes 26GB. Then I can do a simple extrapolation, for 600GB pre-computed minhashes, the indexing process will take 3T RAM. This is just too much. Maybe mapping it to disk could be a viable solution. Looking forward to any suggestions, thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions