-
Notifications
You must be signed in to change notification settings - Fork 0
/
explanation.txt
117 lines (81 loc) · 8.06 KB
/
explanation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
Explanation and Methodology used -
1. The task is to classify sentences to a specific category, essentially a text classification task.
2. Since the data is obfuscated and contains continuous sequence of characters for each sentence, usual NLP pipeline - tokenisation, lemmatization, stemming and stop word removal is not applicable here.(Not directly)
2.1 Data is divided into training and validation sets in 80%-20% ratio.
3. Before using "deep" neural networks it's always a good practise to check the performance of classical models such as Multinomial Naive Bayes(MNB), Simple Logistic Regression(SLR) and Support Vector Machines(SVM) on the given data.
4. In order to move ahead with classical models, data first needs to be processed and mapped into features. For this methods employed are - Tf-IdfVectorizer(Term Frequency-InverseDocument Frequency) and CountVectorizer available from sklearn.feature_extraction.text. Here characters are considered as individual terms/tokens and the counts and frequency is calculated on character level.
5. Once these features are extracted, they are used to train below models with following log_loss values -
5.1 Fitting a simple Logistic Regression on TFIDF:
logloss: 0.999
36.84% is the accuracy percentage(calculated based on accuracy = e^(-logloss))
5.2 Fitting a simple Logistic Regression on CountVectorizer features:
logloss: 0.749
47.30% is the accuracy percentage
5.3 Fitting a simple Multinomial Bayes on TFIDF:
logloss: 1.473
22.91% is the accuracy percentage
5.4 Fitting a simple Multinomial Bayes on CountVectorizer:
logloss: 5.039
0.64% is the accuracy percentage
6. Now Singular value decomposition(with 120 components) is used before fitting the data to a SVM. Data is scaled after SVD and then fed to fit on a SVM.
logloss: 1.087
33.71% is the accuracy percentage
7. Next Xtreme Gradient Boosting(xgboost) is utilized with above extracted features and models.
7.1 Fitting a simple xgboost on TFIDF:
logloss: 0.772
46.19% is the accuracy percentage
7.2 Fitting a simple xgboost on TFIDF SVD features:
logloss: 1.245
28.79% is the accuracy percentage
7.3 Fitting a simple xgboost on CountVectorizer features:
logloss: 0.740
47.71% is the accuracy percentage
8. Next, Grid Search methodology is used to search through optimal parameters for training the above mentioned models. GridSearchCV module from sklearn.model_selection is used.
9. Once these models are done, its obsereved that classification accuracy is not ideal and reaches to a maximum at 47.71% in 7.3(xgboost on CountVectorizer) and 47.30% in 5.2(SLR on CountVectorizer).
10. Now we move towards multi layered pereceptrons(MLPs) for classification task. To begin with data is processed and mapped onto a vector space utilizing GloVe(Global Vectors for word representation - https://nlp.stanford.edu/projects/glove/) features. Here again we work on character level instead of word levels.
Stack used - Python27, TensorFlow, Keras
11. We load the glove vectors into the dictionary(embeddings index) and then map the training sentences to vectors utilizing sent2vec function module. This leads to creation of xtrain_glove and xvalid_glove vector representations of data. This data is further scaled before being fed to a neural network. The training labels(ytrain.txt data) are binarized for feeding to a neural network. After this a sample data looks like(also shown in the jupyter notebook shared) -(also shown in the jupyter notebook shared)
xtrain_glove[0]
[-0.0327601 0.07798596 -0.08264437 ..., -0.00731602 0.01921969
0.04853879]
ytrain[0]
[0 0 0 0 0 0 1 0 0 0 0 0]
12. Next, this data is fed to a simple sequential neural network utilizing Relu, dropout, batchNormalization stacked twice followed by a softmax with 12 outputs. This network utilizes loss='categorical_crossentropy' and optimizer='adam'. This network starts training quickly and it is observed that in around 16 epochs with a batch size of 64, the model reaches its validataion_loss value to 1.72. On continue to train we can see the loss keeps on decreasing while validataion_loss fluctuates and starts increasing again, showing that the model is starting to "OverFit".
(Results are also shown in the jupyter notebook shared)
13. Since the data we have is a sequence of characters, there has to be a contextual relation in occurences of different charaacters. This motivates for the application of LSTM(Long Short Term Memory Networks). Model is contructed in keras.
We use Keras Tokenizer(text.Tokenizer(num_words=None, char_level=True)) on character level for processing input sentences, and create a vocabulary of 26 characters. Vocabulary(word_index) created looks like -
{'a': 11, 'c': 23, 'b': 26, 'e': 5, 'd': 19, 'g': 17, 'f': 20, 'i': 9, 'h': 2, 'k': 12, 'j': 25, 'm': 3, 'l': 6, 'o': 24, 'n': 14, 'q': 13, 'p': 10, 's': 15, 'r': 16, 'u': 1, 't': 8, 'w': 7, 'v': 4, 'y': 21, 'x': 22, 'z': 18}
And a sapmle input sentence xtrain[0] gets mapped to(after padding) -
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 11 8 7 11 3
1 6 1 2 13 17 1 6 11 3 6 16 3 4 5 18 1 2 13 4 12 16 10 3 6
5 8 7 1 6 23 9 8 7 15 12 1 2 6 5 3 4 8 7 11 3 1 6 1 2
9 7 9 7 5 14 1 2 6 16 4 9 3 4 13 4 12 16 1 2 1 6 5 14 11
3 1 6 1 2 13 17 13 4 8 7 4 9 3 4 9 7 1 2 8 7 11 3 1 6
1 2 1 6 13 4 12 16 5 14 11 3 23 9 8 7 1 2 4 9 10 3 10 3 13
4 1 2 15 12 9 7 12 16 10 3 19 20 1 2 6 16 4 9 3 4 15 12 4 9
12 16 10 3 13 4 1 2 15 12 3 4 17 18 5 14 6 5 1 2 13 4 3 4 11
3 1 6 1 2 1 6 5 14 11 3 1 6 1 2 13 4 6 5 8 7 8 7 4 9
10 3 10 3 17 18 6 5 5 14 11 3 1 2 8 7 11 3 1 6 1 2 8 7 6
5 8 7 19 20 1 2 9 7 12 16 22 5 6 5 5 14 8 7 22 5 1 2 10 3
13 4 1 2 8 7 9 7 3 4 11 3 19 20 1 2 10 12 5 18 8 7 11 3 1
6 1 2 4 9 3 4 1 2 13 4 8 7 3 12 10 3 10 3 6 5 6 16 1 2
17 18 8 7 8 7 15 12 1 2 8 7 6 16 12 16 10 3 6 16 1 2 10 3 1
6 1 2 13 4 5 14 1 2 8 7 21 10 6 5 10 3 22 5 1 2 5 14 1 2
11 3 21 10 12 16 13 4 1 2 11 3 1 6 3 4 19 20 1 2 13 4 15 12 5
14 8 7 11 3 6 5 8 7 6 16 6 16 10 3 9 7 1 2 8 7 11 3 1 6]
14. We train a simple LSTM with glove embeddings and two dense layers, reaching to a validation accuracy of around 62% in 260 epochs with batch size 512.
15. We train another bidirectional LSTM with glove embeddings and two dense layers, but this network takes a long time to train. Around 300 seconds per epoch. Could NOT generate feasible results in given time.
16. We train GRU(Gated recurrent Units) with glove embeddings and two dense layers, again this network takes a long time to train. Around 300 seconds per epoch. Could NOT generate feasible results in given time.
17. Finally we utilize Training a 1D convnet with existing GloVe features/vectors, as the training times are much better with convolution networks. Inspired by - Implementing a CNN for Text Classification in TensorFlow http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
18. This network is a Sequential network consisting of an embedding layer based on the embedding_matrix computed earlier, followed by a dropout, a Conv1Dimension layer with 128 filters and 5 as kernel size, followed with Max pooling. We stack 3 of these layers followed with a Relu and finally a softmax with 12 as outputs.
This network reaches 63% validation accuracy in around 9 epochs, with a batch size of 64.
We take this as the final performance due to time constraints and generate the final ytest.txt file.
PS - More improvements could have been done and other architectures utilized, but due to my already arranged Christmas travel plans I could only give this task a couple of days to solve.
Thank you.
Rajveer Shringi
Sources/Motivations -
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
https://nlp.stanford.edu/projects/glove/
https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle