This repository contains the source code for "Scalable Product Duplicate Detection using LSH". The method filters out dissimilar products using minhashing and LSH, after which only the candidate pairs are considered in Agglomerative Clustering with single linkage.
First, download the dataset and open the zip-file into
data/TVs-all-merged.json
. Then, install the packages listed in requirements.txt
.
The main.py
script is the main entry point of the program. Follow these instructions to run the script for various
models:
- MSMP-J: set the
fast
argument of theapply_clustering
function toFalse
- MSMP-Lite: set the
fast
argument of theapply_clustering
function toTrue
In order to use the same hash-function as the orginal MSMP method, pass ''
for the separator
argument to the lsh
function.