Data Analytics Pipeline using Apache Spark

Project Overview

The goal of this project is to develop a data pipeline which is capable of collecting data from New York Times and storing in in hadoop clusters. Features i.e words were extracted characterizing the category where in only the top N frequently occurring words of each category were selected to be the features. The output feature matrix was fed into the classification step where the model was trained using naive bayes and multi layered perceptron. Testing was performed using random articles and the confusion matrix was studied.

Data Pipeline

Data Collection :

How to run : run the DataCollectionNYTimes.ipynb inside data collection folder

by providing the topic name which you are interested and folder name to save the articles (Also provide API details)

For this project, articles were collected from New York Times. The DataCollection.py script file is used to collect the data from NYTimes and save the articles in separate directory based on the category of the article.
The topics include Sports, Business, Politics and Science
2000 articles per category were collected

Feature Extraction :

How to run : spark-submit wordcount.py data\ collection/ Business Science Sports Politics

The input of this process is the location/path of the folder containing the data collected in the previous step. The wordcount.py performs feature extraction, thereby emitting only the features that are required. This is important becasue the unwanted features will affect the accuracy and will take time while training the model. The output of this would be the feature matrix that forms the input for our classification step. The faeature matrix can be found inside the output folder

Multi class Classification :

Now that we have extracted the required features, our next step is to train our model to classify the data. We have used two models to train our data. We train the model and the accuracy is reported

Naive Bayes
How to run : spark-submit naive_bayes.py output/part-00000 Train
Test Accuracy : 91.51

Multi-Layer Perceptron
How to run : spark-submit multilayer_perceptron.py output/part-00000 Train
Test Accuracy : 92.80

Testing:

In this phase, we randomly select few article and ask our model to classify. Based on how well we have trained the model in the previous step our testing accuracy will vary.

Save the input files in the folder and run the spark program articlefeatures.py to get the feature vector of the files and we give the input folder and feature file as input to this program
How to run : spark-submit articlefeatures.py input/[TextFileName].txt features/part-00000
where TextFileName is the article and features/part-0000 is where the output is stored We run both the algorithms and report the results

Naive Bayes

How to run : spark-submit naive_bayes.py output/part-00000 Test classified/part-00000

Naive Bayes Confusion Matrix:

Multilayer Perceptron

How to run : spark-submit multilayer_perceptron.py output/part-00000 Test classified/part-00000

Multilayer Perceptron Confusion Matrix:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analytics Pipeline using Apache Spark

Project Overview

Data Pipeline

Data Collection :

Feature Extraction :

Multi class Classification :

Testing:

Naive Bayes

Multilayer Perceptron

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
classified		classified
data collection		data collection
features		features
images		images
output		output
.gitignore		.gitignore
README.md		README.md
Report.ipynb		Report.ipynb
articlefeatures.py		articlefeatures.py
installation.txt		installation.txt
multilayer_perceptron.py		multilayer_perceptron.py
naive_bayes.py		naive_bayes.py
wordcount.py		wordcount.py

Veera93/data-analytics-apache-spark

Folders and files

Latest commit

History

Repository files navigation

Data Analytics Pipeline using Apache Spark

Project Overview

Data Pipeline

Data Collection :

Feature Extraction :

Multi class Classification :

Testing:

Naive Bayes

Multilayer Perceptron

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages