The goal of this project is to develop a data pipeline which is capable of collecting data from New York Times and storing in in hadoop clusters. Features i.e words were extracted characterizing the category where in only the top N frequently occurring words of each category were selected to be the features. The output feature matrix was fed into the classification step where the model was trained using naive bayes and multi layered perceptron. Testing was performed using random articles and the confusion matrix was studied.
How to run : run the DataCollectionNYTimes.ipynb inside data collection folder
by providing the topic name which you are interested and folder name to save the articles (Also provide API details)
For this project, articles were collected from New York Times.
The DataCollection.py script file is used to collect the data from NYTimes and save the articles in separate directory based on the category of the article.
The topics include Sports, Business, Politics and Science
2000 articles per category were collected
How to run : spark-submit wordcount.py data\ collection/ Business Science Sports Politics
The input of this process is the location/path of the folder containing the data collected in the previous step. The wordcount.py performs feature extraction, thereby emitting only the features that are required. This is important becasue the unwanted features will affect the accuracy and will take time while training the model. The output of this would be the feature matrix that forms the input for our classification step. The faeature matrix can be found inside the output folder
Now that we have extracted the required features, our next step is to train our model to classify the data. We have used two models to train our data. We train the model and the accuracy is reported
- Naive Bayes
How to run : spark-submit naive_bayes.py output/part-00000 Train
Test Accuracy : 91.51 - Multi-Layer Perceptron
How to run : spark-submit multilayer_perceptron.py output/part-00000 Train
Test Accuracy : 92.80
In this phase, we randomly select few article and ask our model to classify. Based on how well we have trained the model in the previous step our testing accuracy will vary.
Save the input files in the folder and run the spark program articlefeatures.py to get the feature vector of the files and we give the input folder and feature file as input to this program
How to run : spark-submit articlefeatures.py input/[TextFileName].txt features/part-00000
where TextFileName is the article and features/part-0000 is where the output is stored
We run both the algorithms and report the results
How to run : spark-submit naive_bayes.py output/part-00000 Test classified/part-00000
How to run : spark-submit multilayer_perceptron.py output/part-00000 Test classified/part-00000