In this project the task was to take Greek newspaper articles from enet dataset and categorize them using Natural Language Processing into the following categories:
- Art
- Economy
- Creece
- Politics
- Sports
- World
The process followed in order to perform this task is described in the image below:
By performing the preprocessing steps on our dataset and by performing dimensionality reduction, we were able to detect the features (terms and bigrams) which are more important in determining the category an article belongs to. Those terms (and bigrams) are shown in the images below:
The classification accuracy for different models using PCA as dimensionality reduction is shown below:
In order to run this project you need tensorflow 1.8, numpy and matplotlib. After installing those packages there are 2 scripts you can run:
-
run_single_model.py
which performs dimensionality reduction using LDA or PCA, performs data transformation and then performs classification using one of the following models:- NB
- SVM
- RandomForest
- GMM
- KNN
- ANN
- CNN
-
model_param_plot.py
which has the following options:- Visualize bigrams: Creates a word-cloud visualization of the most important bigrams.
- Visualize terms: Creates a word-cloud visualization of the most important terms.
- Run model grid param search: For every model (GMM, MEAN, RandomForest, SVM, KNN) it evaluates and displays the model accuracy for different sets of parameters.
- Run all models kfold: Runs every model using kfold cross validation and displays the accuracy in a common figure.