Skip to content

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

License

Notifications You must be signed in to change notification settings

ekzhu/datasketch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

5512549 · Mar 26, 2024
Mar 13, 2023
Mar 26, 2024
Mar 12, 2024
Mar 12, 2024
Mar 12, 2024
Mar 12, 2024
Dec 10, 2020
Dec 10, 2020
Oct 2, 2023
Jan 31, 2020
Apr 1, 2015
Jan 3, 2024
Feb 19, 2023

Repository files navigation

datasketch: Big Data Looks Small

https://static.pepy.tech/badge/datasketch/month

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch Usage
MinHash estimate Jaccard similarity and cardinality
Weighted MinHash estimate weighted Jaccard similarity
HyperLogLog estimate cardinality
HyperLogLog++ estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index For Data Sketch Supported Query Type
MinHash LSH MinHash, Weighted MinHash Jaccard Threshold
MinHash LSH Forest MinHash, Weighted MinHash Jaccard Top-K
MinHash LSH Ensemble MinHash Containment Threshold
HNSW Any Custom Metric Top-K

datasketch must be used with Python 3.7 or above, NumPy 1.11 or above, and Scipy.

Note that MinHash LSH and MinHash LSH Ensemble also support Redis and Cassandra storage layer (see MinHash LSH at Scale).

Install

To install datasketch using pip:

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

pip install datasketch[redis]

To install with Cassandra dependency:

pip install datasketch[cassandra]