This repo demonstrates common big data analytics algorithms in python. The examples refer to the problems in Stanford's CS246 course.
Some codes require Apache Spark API to leverage MapReduce style of workload parallelism.
- Python 2
- Apache Spark 2.4
- NumPy
- Pandas
No | Description | Spark |
---|---|---|
1 | Friend recommendation by mining social-network graphs | ✔️ |
2 | A-priori Algorithm: mining baskets for frequent itemsets | |
3 | Locality-sensitive Hashing: finding similar items | |
4 | K-means Clustering | ✔️ |
5 | Dimensionality Reduction: principal component analysis, CUR decomposition | |
6 | Collaborative Filtering: mining ratings database for movie recommendation | |
7 | PageRank | ✔️ |
8 | Girvan-Newman Algorithm: community detection in social-network graphs | ✔️ |
9 | Support Vector Machine | |
10 | Deep Learning | |
11 | DGIM algorithm: mining continuous stream of data |
- Upgrade to Python 3
- Pandas support