Skip to content

A data pipeline capable of collecting data from New York Times and storing in in hadoop clusters. It performs features extraction and trains a classification model.

Notifications You must be signed in to change notification settings

Veera93/data-analytics-apache-spark

Repository files navigation

Data Analytics Pipeline using Apache Spark

Project Overview

The goal of this project is to develop a data pipeline which is capable of collecting data from New York Times and storing in in hadoop clusters. Features i.e words were extracted characterizing the category where in only the top N frequently occurring words of each category were selected to be the features. The output feature matrix was fed into the classification step where the model was trained using naive bayes and multi layered perceptron. Testing was performed using random articles and the confusion matrix was studied.

Data Pipeline

Flow chart

Data Collection :

How to run : run the DataCollectionNYTimes.ipynb inside data collection folder

by providing the topic name which you are interested and folder name to save the articles (Also provide API details)

For this project, articles were collected from New York Times. The DataCollection.py script file is used to collect the data from NYTimes and save the articles in separate directory based on the category of the article.
The topics include Sports, Business, Politics and Science
2000 articles per category were collected

Feature Extraction :

How to run : spark-submit wordcount.py data\ collection/ Business Science Sports Politics

The input of this process is the location/path of the folder containing the data collected in the previous step. The wordcount.py performs feature extraction, thereby emitting only the features that are required. This is important becasue the unwanted features will affect the accuracy and will take time while training the model. The output of this would be the feature matrix that forms the input for our classification step. The faeature matrix can be found inside the output folder

Multi class Classification :

Now that we have extracted the required features, our next step is to train our model to classify the data. We have used two models to train our data. We train the model and the accuracy is reported

  • Naive Bayes
    How to run : spark-submit naive_bayes.py output/part-00000 Train
    Test Accuracy : 91.51

  • Multi-Layer Perceptron
    How to run : spark-submit multilayer_perceptron.py output/part-00000 Train
    Test Accuracy : 92.80

Testing:

In this phase, we randomly select few article and ask our model to classify. Based on how well we have trained the model in the previous step our testing accuracy will vary.

Save the input files in the folder and run the spark program articlefeatures.py to get the feature vector of the files and we give the input folder and feature file as input to this program
How to run : spark-submit articlefeatures.py input/[TextFileName].txt features/part-00000
where TextFileName is the article and features/part-0000 is where the output is stored We run both the algorithms and report the results

Naive Bayes

How to run : spark-submit naive_bayes.py output/part-00000 Test classified/part-00000

Naive Bayes Confusion Matrix: Naive Bayes

Multilayer Perceptron

How to run : spark-submit multilayer_perceptron.py output/part-00000 Test classified/part-00000

Multilayer Perceptron Confusion Matrix: Perceptron

About

A data pipeline capable of collecting data from New York Times and storing in in hadoop clusters. It performs features extraction and trains a classification model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published