PatchWork

Clustering, sometimes called unsupervised learning, is one of the most fundamental step in understanding a dataset, aiming to discover the unknown nature of data through the separation of a finite dataset, with little or no ground truth, into a finite and discrete set of “natural,” hidden data structures. Given a set of n points in a multidimensional space, the purpose of clustering is to group them into several sets based on similarity measures and distance vectors.

Most clustering algorithms have a quadratic complexity with n, making then inadequate to analyze large amounts of data. In addition, many clustering algorithms are methods are inherently difficult to parallelize.

PatchWork is a novel clustering algorithm to address those issues. It is a mixture of density and grid-based clustering algorithm. It has linear complexity and near linear horizontal scalability. As a result, PatchWork can cluster a billion points in a few minutes only, a 40x improvement over Spark MLLib native implementation of the well-known K-Means.

Results and scalability

We evaluated and compared PatchWork against the Spark MLLib implementation of the k-means|| algorithm, as well as Spark_DBSCAN.

Evaluation

We evaluated PatchWork using four synthetic datasets commonly used to evaluate clustering algorithms. Those datasets simulate different distributions of the data: Jain, Spiral, Aggregation and Compound datasets. The size of those datasets is very limited and not suitable for big-data applications. Hence, for each of those four datasets, we generated datasets of increasing size, up to 1.2 billion entries. Those datasets have the same data distribution as the original datasets while allowing us to test the scalability of the algorithms.

The parameters of each algorithms were empirically tuned for each dataset. The results of these experiments are shown below:

Running-time

We benchmarked the performance of PatchWork and other algorithms using a cluster of six commodity servers, each with an Intel Xeon CPU E5-2650 processor with 8 cores at 2.60GHz, 192GB of RAM and 30TB of storage. The servers are interconnected using 10 Gigabit Ethernet. Experiments were conducted using HDFS 2.6.0 and Apache Spark 1.3.0 packaged in the Cloudera CDH 5.4.0 distribution.

The figure below shows the running-time of the three algorithms to cluster up to 1 billion data points. Both X and Y-axis are logarithmic and represents the running-time in seconds. The quadratic complexity of DBSCAN significantly impact the running-time of the algorithm. For large datasets with over 10 millions entries, Spark-DBSCAN was unable to terminate in a timely fashion, hence is not shown. In fact, in half the time, PatchWork could cluster 1 billion points while Spark-DBSCAN could only cluster 1.2 million points, a 1000-fold improvement.

Both PatchWork and k-means have a linear computation complexity. However, for very large datasets, PatchWork is very significantly faster than the native Spark implementation of k-means: when clustering 1 billion points, PatchWork is 40 times faster.

Horizontal scalability

Both MLLib k-means and PatchWork have a near-linear horizontal scalability, a critical property of distributed algorithms for big-data applications:

How to cite PatchWork

More details are available in the following technical paper:

Frank Gouineau, Tom Landry and Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track, Pisa, Italia (accepted)

How to run the demo

At the root of the project: ./runPatchwork.sh

The demo reads data from a sample file, runs PatchWork to define the clusters, displays the clusters for each data point, then a summary of the clusters. The demo runs in a few seconds. If everything runs smoothly, you will see something like:

-----------------------------------------
number of points : 7826
number of clusters : 5
-----------------------------------------
   cluster 1 has 1370 cells
   cluster 6 has 689 cells
   cluster 14 has 193 cells
   cluster 19 has 117 cells
   cluster 25 has 299 cells
-----------------------------------------
size of epsilon : [0.2,0.2]
min pts in each cell : 1
time of training : 13450 ms
-----------------------------------------

Tests were executed using a Cloudera CDH 5.4.0 featuring Spark 1.3.0.

How to compile

By default, you need a running local Spark server. If you want to run the algorithm on a cluster (e.g. on a YARN cluster), you can edit the SparkContext and the runPatchwork.sh script and recompile the project. sbt is required to compile the project.

To compile (at the root of the project): ./build.sh

Contact

For all questions, contact me at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
bin		bin
datasets		datasets
doc/img		doc/img
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
build.sh		build.sh
runPatchwork.sh		runPatchwork.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatchWork

Results and scalability

Evaluation

Running-time

Horizontal scalability

How to cite PatchWork

Other presentations:

How to run the demo

How to compile

Contact

About

Releases

Packages

Contributors 2

Languages

License

crim-ca/patchwork

Folders and files

Latest commit

History

Repository files navigation

PatchWork

Results and scalability

Evaluation

Running-time

Horizontal scalability

How to cite PatchWork

Other presentations:

How to run the demo

How to compile

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages