README

Sentence Classification

This project aims at classifying sentences to the specific book/novel they belong to.

The task is to classify sentences to a specific category, essentially a text classification task.
Input/training data is obfuscated and contains continuous sequence of characters for each sentence, usual NLP pipeline - tokenisation, lemmatization, stemming and stop word removal is not applicable here.(Not directly)

2.1 Data is divided into training and validation sets in 80%-20% ratio.

Before using "deep" neural networks it's always a good practise to check the performance of classical models such as Multinomial Naive Bayes(MNB), Simple Logistic Regression(SLR) and Support Vector Machines(SVM) on the given data.
In order to move ahead with classical models, data first needs to be processed and mapped into features. For this methods employed are - Tf-IdfVectorizer(Term Frequency-InverseDocument Frequency) and CountVectorizer available from sklearn.feature_extraction.text. Here characters are considered as individual terms/tokens and the counts and frequency is calculated on character level.
Once these features are extracted, they are used to train below models with following log_loss values -

5.1 Fitting a simple Logistic Regression on TFIDF: logloss: 0.999 36.84% is the accuracy percentage(calculated based on accuracy = e^(-logloss))

5.2 Fitting a simple Logistic Regression on CountVectorizer features: logloss: 0.749 47.30% is the accuracy percentage

5.3 Fitting a simple Multinomial Bayes on TFIDF: logloss: 1.473 22.91% is the accuracy percentage

5.4 Fitting a simple Multinomial Bayes on CountVectorizer: logloss: 5.039 0.64% is the accuracy percentage

Now Singular value decomposition(with 120 components) is used before fitting the data to a SVM. Data is scaled after SVD and then fed to fit on a SVM. logloss: 1.087 33.71% is the accuracy percentage
Next Xtreme Gradient Boosting(xgboost) is utilized with above extracted features and models.

7.1 Fitting a simple xgboost on TFIDF: logloss: 0.772 46.19% is the accuracy percentage

7.2 Fitting a simple xgboost on TFIDF SVD features: logloss: 1.245 28.79% is the accuracy percentage

7.3 Fitting a simple xgboost on CountVectorizer features: logloss: 0.740 47.71% is the accuracy percentage

Next, Grid Search methodology is used to search through optimal parameters for training the above mentioned models. GridSearchCV module from sklearn.model_selection is used.
Once these models are done, its obsereved that classification accuracy is not ideal and reaches to a maximum at 47.71% in 7.3(xgboost on CountVectorizer) and 47.30% in 5.2(SLR on CountVectorizer).
Now we move towards multi layered pereceptrons(MLPs) for classification task. To begin with data is processed and mapped onto a vector space utilizing GloVe(Global Vectors for word representation - https://nlp.stanford.edu/projects/glove/) features. Here again we work on character level instead of word levels.

Stack used - Python27, TensorFlow, Keras

We load the glove vectors into the dictionary(embeddings index) and then map the training sentences to vectors utilizing sent2vec function module. This leads to creation of xtrain_glove and xvalid_glove vector representations of data. This data is further scaled before being fed to a neural network. The training labels(ytrain.txt data) are binarized for feeding to a neural network. After this a sample data looks like(also shown in the jupyter notebook shared): xtrain_glove[0] [-0.0327601 0.07798596 -0.08264437 ..., -0.00731602 0.01921969 0.04853879] ytrain[0] [0 0 0 0 0 0 1 0 0 0 0 0]
Next, this data is fed to a simple sequential neural network utilizing Relu, dropout, batchNormalization stacked twice followed by a softmax with 12 outputs. This network utilizes loss='categorical_crossentropy' and optimizer='adam'. This network starts training quickly and it is observed that in around 16 epochs with a batch size of 64, the model reaches its validataion_loss value to 1.72. On continue to train we can see the loss keeps on decreasing while validataion_loss fluctuates and starts increasing again, showing that the model is starting to "OverFit".
Since the data we have is a sequence of characters, there has to be a contextual relation in occurences of different charaacters. This motivates for the application of LSTM(Long Short Term Memory Networks). Model is contructed in keras.

We use Keras Tokenizer(text.Tokenizer(num_words=None, char_level=True)) on character level for processing input sentences, and create a vocabulary of 26 characters. Vocabulary(word_index) created looks like - {'a': 11, 'c': 23, 'b': 26, 'e': 5, 'd': 19, 'g': 17, 'f': 20, 'i': 9, 'h': 2, 'k': 12, 'j': 25, 'm': 3, 'l': 6, 'o': 24, 'n': 14, 'q': 13, 'p': 10, 's': 15, 'r': 16, 'u': 1, 't': 8, 'w': 7, 'v': 4, 'y': 21, 'x': 22, 'z': 18}

And a sapmle input sentence xtrain[0] gets mapped to(after padding) - [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 11 8 7 11 3 1 6 1 2 13 17 1 6 11 3 6 16 3 4 5 18 1 2 13 4 12 16 10 3 6 5 8 7 1 6 23 9 8 7 15 12 1 2 6 5 3 4 8 7 11 3 1 6 1 2 9 7 9 7 5 14 1 2 6 16 4 9 3 4 13 4 12 16 1 2 1 6 5 14 11 3 1 6 1 2 13 17 13 4 8 7 4 9 3 4 9 7 1 2 8 7 11 3 1 6 1 2 1 6 13 4 12 16 5 14 11 3 23 9 8 7 1 2 4 9 10 3 10 3 13 4 1 2 15 12 9 7 12 16 10 3 19 20 1 2 6 16 4 9 3 4 15 12 4 9 12 16 10 3 13 4 1 2 15 12 3 4 17 18 5 14 6 5 1 2 13 4 3 4 11 3 1 6 1 2 1 6 5 14 11 3 1 6 1 2 13 4 6 5 8 7 8 7 4 9 10 3 10 3 17 18 6 5 5 14 11 3 1 2 8 7 11 3 1 6 1 2 8 7 6 5 8 7 19 20 1 2 9 7 12 16 22 5 6 5 5 14 8 7 22 5 1 2 10 3 13 4 1 2 8 7 9 7 3 4 11 3 19 20 1 2 10 12 5 18 8 7 11 3 1 6 1 2 4 9 3 4 1 2 13 4 8 7 3 12 10 3 10 3 6 5 6 16 1 2 17 18 8 7 8 7 15 12 1 2 8 7 6 16 12 16 10 3 6 16 1 2 10 3 1 6 1 2 13 4 5 14 1 2 8 7 21 10 6 5 10 3 22 5 1 2 5 14 1 2 11 3 21 10 12 16 13 4 1 2 11 3 1 6 3 4 19 20 1 2 13 4 15 12 5 14 8 7 11 3 6 5 8 7 6 16 6 16 10 3 9 7 1 2 8 7 11 3 1 6]

We train a simple LSTM with glove embeddings and two dense layers, reaching to a validation accuracy of around 62% in 260 epochs with batch size 512.
We train another bidirectional LSTM with glove embeddings and two dense layers, but this network takes a long time to train. Around 300 seconds per epoch. After training for 160 epochs this network reaches accuracy of 75.33%. This is the best performing model seen as far.
We train GRU(Gated recurrent Units) with glove embeddings and two dense layers, again this network takes a long time to train. Around 300 seconds per epoch. Maximum accuracy reached via this network is 30%
Finally we utilize Training a 1D convnet with existing GloVe features/vectors, as the training times are much better with convolution networks. Inspired by - Implementing a CNN for Text Classification in TensorFlow http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
This network is a Sequential network consisting of an embedding layer based on the embedding_matrix computed earlier, followed by a dropout, a Conv1Dimension layer with 128 filters and 5 as kernel size, followed with Max pooling. We stack 3 of these layers followed with a Relu and finally a softmax with 12 as outputs. CNN reaches the accuracy of 64%

How do I get set up?

Working installation of tensorflow, keras version 1.3.X
Other dependencies such as nltk, sklearn, pandas
Jupyter notebook
Python 2.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

README

Sentence Classification

How do I get set up?

Files

README.md

Latest commit

History

README.md

File metadata and controls

README

Sentence Classification

How do I get set up?