-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.xml
3503 lines (3332 loc) · 173 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>François de Ryckel</title>
<link>/</link>
<atom:link href="/index.xml" rel="self" type="application/rss+xml" />
<description>François de Ryckel</description>
<generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Sun, 07 Jun 2020 00:00:00 +0000</lastBuildDate>
<image>
<url>/images/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_2.png</url>
<title>François de Ryckel</title>
<link>/</link>
</image>
<item>
<title>Example Page 1</title>
<link>/courses/example/example1/</link>
<pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
<guid>/courses/example/example1/</guid>
<description><p>In this tutorial, I&rsquo;ll share my top 10 tips for getting started with Academic:</p>
<h2 id="tip-1">Tip 1</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
<h2 id="tip-2">Tip 2</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
</description>
</item>
<item>
<title>Example Page 2</title>
<link>/courses/example/example2/</link>
<pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
<guid>/courses/example/example2/</guid>
<description><p>Here are some more tips for getting started with Academic:</p>
<h2 id="tip-3">Tip 3</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
<h2 id="tip-4">Tip 4</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.</p>
<p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.</p>
<p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.</p>
<p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.</p>
<p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.</p>
</description>
</item>
<item>
<title>Disaster Tweets - Part iii</title>
<link>/post/disaster-tweets-part-iii/</link>
<pubDate>Sun, 07 Jun 2020 00:00:00 +0000</pubDate>
<guid>/post/disaster-tweets-part-iii/</guid>
<description>
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<pre class="r"><code>library(readr) # to read and write (import / export) any type into our R console.
library(dplyr) # for pretty much all our data wrangling
library(ggplot2)
library(stringr)
library(forcats)
library(purrr)
library(janitor) # to clear variable names with clean_names()</code></pre>
</div>
<div id="using-glove-embedding" class="section level1">
<h1>Using glove embedding</h1>
<p>GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<p>GloVe encodes the ratios of word-word co-occurrence probabilities, which is thought to represent some crude form of meaning associated with the abstract concept of the word, as vector difference. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence.</p>
<p>The simple workflow for vectorizing tweet text into glove embeddings is as follows - ^/[<a href="https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/" class="uri">https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/</a>]</p>
<ol style="list-style-type: decimal">
<li>Tokenize incoming tweet texts in the training data.</li>
<li>Download and parse glove embeddings into an embedding matrix for the tokenized words.</li>
<li>Generate embeddings vector for tweets text in training data.</li>
<li>Generate embeddings vector for tweets text in test data.</li>
<li>Append to given tweets features and export.</li>
</ol>
<p>We will not stem or lemmatize the tweets at first; this will keep most of the meaning in the word used.</p>
<pre class="r"><code>clean_tweets &lt;- function(df){
df &lt;- df %&gt;%
mutate(number_hashtag = str_count(string = text, pattern = &quot;#&quot;),
number_number = str_count(string = text, pattern = &quot;[0-9]&quot;) %&gt;% as.numeric(),
number_http = str_count(string = text, pattern = &quot;http&quot;) %&gt;% as.numeric(),
number_mention = str_count(string = text, pattern = &quot;@&quot;) %&gt;% as.numeric(),
number_location = if_else(!is.na(location), 1, 0),
number_keyword = if_else(!is.na(keyword), 1, 0),
number_repeated_char = str_count(string = text, pattern = &quot;([a-z])\\1{2}&quot;) %&gt;% as.numeric(),
text = str_replace_all(string = text, pattern = &quot;http[^[:space:]]*&quot;, replacement = &quot;&quot;),
text = str_replace_all(string = text, pattern = &quot;@[^[:space:]]*&quot;, replacement = &quot;&quot;),
number_char = nchar(text), #add the length of the tweet in character.
number_word = str_count(string = text, pattern = &quot;\\w+&quot;),
text = str_replace_all(string = text, pattern = &quot;[0-9]&quot;, replacement = &quot;&quot;),
text = future_map(text, function(.x) stringi::stri_trans_general(.x, &quot;Latin-ASCII&quot;)) %&gt;% unlist(.),
text = str_replace_all(string = text, pattern = &quot;\u0089&quot;, replacement = &quot;&quot;)) %&gt;%
select(-keyword, -location)
return(df)
}
library(furrr)
plan(&quot;multicore&quot;)
df_train &lt;- read_csv(&quot;~/disaster_tweets/data/train.csv&quot;) %&gt;% clean_tweets()
# sorting out the same tweets, different target issues
temp &lt;- df_train %&gt;% group_by(text) %&gt;%
mutate(mean_target = mean(target),
new_target = if_else(mean_target &gt; 0.5, 1, 0)) %&gt;% ungroup() %&gt;%
mutate(target = new_target,
target_bin = factor(if_else(target == 1, &quot;a_truth&quot;, &quot;b_false&quot;))) %&gt;%
select(-new_target, -mean_target, -target)
df_train &lt;- temp</code></pre>
<p>Using keras’ text_tokenizer to tokenize the text in tweets dataset.</p>
<pre class="r"><code>library(keras)
# we assign each word in the whole tweets df corpus an ID
tokenizer &lt;- text_tokenizer() %&gt;% fit_text_tokenizer(df_train$text)
# if we want to check how many different words were in the corpus.
# we do +1 because we&#39;re dealing with Python.
num_words &lt;- length(tokenizer$word_index) + 1
# Using the above fit tokenizer, one now convert all the text to an actual sequences of indices.
sequences &lt;- texts_to_sequences(tokenizer, df_train$text)
## how long is the longest tweet? 33 words! We can use that as the base for padding.
summary(map_int(sequences, length))</code></pre>
<pre><code>## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 13.00 13.64 18.00 32.00</code></pre>
<pre class="r"><code>max_tweet_length &lt;- max(map_int(sequences, length))
# now, we need to pad all other tweet to a length of 33.
# by default we pad first, then put the text.
padded_sequences &lt;- pad_sequences(sequences = sequences, maxlen = max_tweet_length)
# checking that we do have a 7613 tweets x 32 columns matrix.
dim(padded_sequences) </code></pre>
<pre><code>## [1] 7613 32</code></pre>
<p>Let’s have a look at the first 5 tweet were, their conversion into indices and their final padded form.</p>
<pre class="r"><code># the first 5 tweets in words
df_train$text[1:5]</code></pre>
<pre><code>## [1] &quot;Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all&quot;
## [2] &quot;Forest fire near La Ronge Sask. Canada&quot;
## [3] &quot;All residents asked to &#39;shelter in place&#39; are being notified by officers. No other evacuation or shelter in place orders are expected&quot;
## [4] &quot;, people receive #wildfires evacuation orders in California&quot;
## [5] &quot;Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school&quot;</code></pre>
<pre class="r"><code># the first 5 tweets in indices
sequences[1:5]</code></pre>
<pre><code>## [[1]]
## [1] 113 4389 20 1 830 5 18 247 135 1562 4390 84 36
##
## [[2]]
## [1] 184 42 215 764 6440 6441 1354
##
## [[3]]
## [1] 36 1690 1563 4 6442 3 6443 20 128 6444 17 1691 35 419 241
## [16] 53 2085 3 686 1355 20 1070
##
## [[4]]
## [1] 58 4391 1447 241 1355 3 91
##
## [[5]]
## [1] 30 92 1182 18 312 19 6445 2356 26 256 19 1447 6446 66 2
## [16] 179</code></pre>
<pre class="r"><code># And the first tweet with padding
padded_sequences[1:5, ]</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0 36 1690 1563 4
## [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,] 0 0 0 0 0 113 4389 20 1 830 5 18
## [2,] 0 0 0 0 0 0 0 0 0 0 0 184
## [3,] 6442 3 6443 20 128 6444 17 1691 35 419 241 53
## [4,] 0 0 0 0 0 0 0 0 0 0 0 58
## [5,] 0 0 30 92 1182 18 312 19 6445 2356 26 256
## [,27] [,28] [,29] [,30] [,31] [,32]
## [1,] 247 135 1562 4390 84 36
## [2,] 42 215 764 6440 6441 1354
## [3,] 2085 3 686 1355 20 1070
## [4,] 4391 1447 241 1355 3 91
## [5,] 19 1447 6446 66 2 179</code></pre>
<p>??????? A total of 22701 unique words were assigned an index in the tokenization.</p>
<p>Borrowing the code from Aditya Mangal’s blog <a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> for parsing and generating glove embedding matrix from my deepSentimentR package.</p>
<pre class="r"><code>parse_glove_embeddings &lt;- function(file_path) {
lines &lt;- readLines(file_path)
embeddings_index &lt;- new.env(hash = TRUE, parent = emptyenv())
for (i in 1:length(lines)) {
line &lt;- lines[[i]]
values &lt;- strsplit(line, &quot; &quot;)[[1]]
word &lt;- values[[1]]
embeddings_index[[word]] &lt;- as.double(values[-1])
}
cat(&quot;Found&quot;, length(embeddings_index), &quot;word vectors.\n&quot;)
return(embeddings_index)
}
generate_embedding_matrix &lt;- function(word_index, embedding_dim, max_words, glove_file_path) {
embeddings_index &lt;- parse_glove_embeddings(glove_file_path)
embedding_matrix &lt;- array(0, c(max_words, embedding_dim))
for (word in names(word_index)) {
index &lt;- word_index[[word]]
if (index &lt; max_words) {
embedding_vector &lt;- embeddings_index[[word]]
if (!is.null(embedding_vector)) {
embedding_matrix[index+1,] &lt;- embedding_vector
}
}
}
return(embedding_matrix)
}</code></pre>
<p>The Glove project has a Twitter dataset trained on 2B tweets with 27B tokens. It comes with word vectors that are 25d, 50d, 100d or 200d.</p>
<p>We’ll try different variant and we’ll adjust in functions of our results.</p>
<pre class="r"><code># To pick the length of each word vectors
embedding_dim &lt;- 25
#embedding_dim &lt;- 50
# this operation is the crux of the whole numerization of our text.
# we basically assign a word-vector for each word. We decided to go with a 50d dense vector.
embedding_matrix &lt;- generate_embedding_matrix(tokenizer$word_index, embedding_dim = 25, max_words = num_words,
&quot;~/glove/glove.twitter.27B.25d.txt&quot;)</code></pre>
<pre><code>## Found 1193514 word vectors.</code></pre>
<pre class="r"><code>#embedding_matrix &lt;- generate_embedding_matrix(tokenizer$word_index, embedding_dim = 50, max_words = num_words,
# &quot;data_glove.twitter.27B/glove.twitter.27B.50d.txt&quot;)
#there were around 12,638 different words in all the tweets. We have change all of these words in a 50d vectors.
# so now we should have a matrix of dimension 12638 by 50
dim(embedding_matrix)</code></pre>
<pre><code>## [1] 15093 25</code></pre>
<pre class="r"><code>#Let&#39;s save that precious matrix for further use
#write_rds(x = embedding_matrix, path = &quot;data/embedding_matrix_50d.rds&quot;)
write_rds(x = embedding_matrix, path = &quot;~/disaster_tweets/data/embedding_matrix_25d.rds&quot;)</code></pre>
<p>Using the Keras modeling framework to generate embeddings for the given training data. We basically create a simple sequential model with one embedding layer whose weights we will freeze based on our embedding matrix created above, and a flattening layer that will flatten the output into a 2D matrix of dimensions 7613, 32x25 for 25d and (7613, 32x50) for 50d word vectors.</p>
<p>Remember the longest tweet had 32 words. Each words is a 50d vector. So we want at the end matrix of 7613 x 1600 or (32x50). For many tweets, that matrix going to start with a bunch of zeros because of the padding. Remember the padding is at the start in our case.</p>
<p>So we now we need to apply that embedding to each of the 7613 tweet. Keras will do that for us.</p>
<pre class="r"><code>embedding_matrix &lt;- read_rds(&quot;~/disaster_tweets/data/embedding_matrix_25d.rds&quot;)
#embedding_matrix &lt;- read_rds(&quot;data/embedding_matrix_50d.rds&quot;)
model_embedding &lt;- keras_model_sequential() %&gt;%
layer_embedding(input_dim = num_words, #number of total words in all of the tweets
output_dim = embedding_dim, #the length of our embedding vectors (50d in this case)
input_length = max_tweet_length, #the number of words of the longest tweet. All other tweets will be padded to have that length
name = &quot;embedding&quot;) %&gt;%
layer_flatten(name = &quot;flatten&quot;)
model_embedding %&gt;%
get_layer(name = &quot;embedding&quot;) %&gt;%
set_weights(list(embedding_matrix)) %&gt;%
freeze_weights()
tweets_embedding &lt;- model_embedding %&gt;% predict(padded_sequences)</code></pre>
<p>So, let’s make sense of what is happening. Each tweets is now 800 variables long (32 words x 25d). The first tweet was: [1] “Our deed be the Reason of this # earthquake May ALLAH Forgive us all”. This tweet is 13 words long. So the last 325 variables should be filled, when the first 475 should be 0s. Let’s check that.</p>
<pre class="r"><code>str(tweets_embedding)</code></pre>
<pre><code>## num [1:7613, 1:800] 0 0 0 0 0 0 0 0 0 0 ...</code></pre>
<pre class="r"><code># and part of the first tweet.
tweets_embedding[1, 450:500]</code></pre>
<pre><code>## [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [8] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [15] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## [22] 0.000000 0.000000 0.000000 0.000000 0.000000 -0.420470 0.565260
## [29] -0.033577 0.310190 0.189300 -0.645880 1.387600 -0.574840 -0.138960
## [36] -0.390030 -0.169110 -0.073094 -5.702100 0.812640 -0.412840 -0.438670
## [43] 0.361850 -0.344710 0.146530 0.076999 -1.275600 -0.631900 -0.635160
## [50] -0.517290 -0.901670</code></pre>
<p>We can now add these matrix to our initial df.</p>
<pre class="r"><code>df_train_glove &lt;- bind_cols(df_train, as_tibble(tweets_embedding, .name_repair = &quot;unique&quot;) %&gt;% clean_names()) %&gt;%
clean_names()
# and let&#39;s save all this had work!
write_rds(x = df_train_glove, path = &quot;~/disaster_tweets/data/train_glove_25d.rds&quot;)
#write_rds(x = df_train_glove, path = &quot;data/train_glove_50d.rds&quot;)</code></pre>
<p>Before we go on and model, we still need to process our test data.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p><a href="https://nlp.stanford.edu/projects/glove/" class="uri">https://nlp.stanford.edu/projects/glove/</a><a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p><a href="https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/" class="uri">https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/</a><a href="#fnref2" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</description>
</item>
<item>
<title>Disaster Tweets - Part II</title>
<link>/post/disaster-tweets-part-ii/</link>
<pubDate>Tue, 26 May 2020 00:00:00 +0000</pubDate>
<guid>/post/disaster-tweets-part-ii/</guid>
<description>
<p>In the second part of this NLP task, we will use Singular Value Decomposition to help us transform a sparse matrix (from the Document Term Matrix - dtm) into a dense matrix. Hence this is still very much a BOW approach. This approach combined with xgboost gave us the best results without using word-embedding (or word-vectors) techniques. That said, we are not sure how this approach would work in production as it seems we would have to constantly regenerate the dense matrix (which is quite computationally intense). We would love to see / hear from others on how to use svd in this type of task.</p>
<p>In a sense, SVD can be seen as a dimensionality reduction technique:going from a very wide sparse matrix (as many columns as there are different words in all the tweets), to a dense one.</p>
<p>So let’s first to build that sparse matrix: on the rows, the document number (in this case the tweet ID) on the columns the word (1 word per column)</p>
<p>Because the dimensionality reduction is based on the words, we need to use the whole dataset for this task. Of course this is not really reasonable in the case of new cases.</p>
<p>Also, since we have already developed a whole cleaning workflow, let’s re-use it on the whole df.</p>
<div id="setting-up" class="section level1">
<h1>Setting up</h1>
<pre class="r"><code>library(readr) # to read and write (import / export) any type into our R console.
library(dplyr) # for pretty much all our data wrangling
library(ggplot2)
library(stringr)
library(forcats)
library(purrr)
library(kableExtra)
library(rsample) # to use initial_split() and some other resampling techniques later on.
library(recipes) # to use the recipe() and step_() functions
library(parsnip) # the main engine that run the models
library(workflows) # to use workflow()
library(tune) # to fine tune the hyperparameters
library(dials) # to use grid_regular(), tune_grid(), penalty()
library(yardstick) # to create the measure of accuracy, f1 score and ROC-AUC
library(doParallel) #to parallelize the work - useful in tune()
library(tidytext)
library(textrecipes)</code></pre>
<p>We’ll be reusing the same clean_tweets() function we have used on part I to clean the tweets. We just copy-paste it here and repurpose it.</p>
<pre class="r"><code>df_train &lt;- read_csv(&quot;~/disaster_tweets/data/train.csv&quot;) %&gt;% as_tibble() %&gt;% select(id, text, keyword, location)
df_test &lt;- read_csv(&quot;~/disaster_tweets/data/test.csv&quot;) %&gt;% as_tibble() %&gt;% select(id, text, keyword, location)
df_all &lt;- bind_rows(df_train, df_test)
clean_tweets &lt;- function(df){
df &lt;- df %&gt;%
mutate(number_hashtag = str_count(string = text, pattern = &quot;#&quot;),
number_number = str_count(string = text, pattern = &quot;[0-9]&quot;) %&gt;% as.numeric(),
number_http = str_count(string = text, pattern = &quot;http&quot;) %&gt;% as.numeric(),
number_mention = str_count(string = text, pattern = &quot;@&quot;) %&gt;% as.numeric(),
number_location = if_else(!is.na(location), 1, 0),
number_keyword = if_else(!is.na(keyword), 1, 0),
number_repeated_char = str_count(string = text, pattern = &quot;([a-z])\\1{2}&quot;) %&gt;% as.numeric(),
text = str_replace_all(string = text, pattern = &quot;http[^[:space:]]*&quot;, replacement = &quot;&quot;),
text = str_replace_all(string = text, pattern = &quot;@[^[:space:]]*&quot;, replacement = &quot;&quot;),
number_char = nchar(text), #add the length of the tweet in character.
number_word = str_count(string = text, pattern = &quot;\\w+&quot;),
text = str_replace_all(string = text, pattern = &quot;[0-9]&quot;, replacement = &quot;&quot;),
text = map(text, textstem::lemmatize_strings) %&gt;% unlist(.),
text = map(text, function(.x) stringi::stri_trans_general(.x, &quot;Latin-ASCII&quot;)) %&gt;% unlist(.),
text = str_replace_all(string = text, pattern = &quot;\u0089&quot;, replacement = &quot;&quot;)) %&gt;%
select(-keyword, -location)
return(df)
}
df_all &lt;- clean_tweets(df_all)</code></pre>
</div>
<div id="finding-the-svd-matrix" class="section level1">
<h1>Finding the SVD matrix</h1>
<p>Let’s now works on our sparse matrix with the bind_tf_idf() functions. First, we’ll need to tokenize the tweets and remove stop-words. To be able to use the tf_idf, we’ll also need to count the occurrence of each word in each tweet.</p>
<pre class="r"><code>df_all_tok &lt;- df_all %&gt;%
unnest_tokens(word, text) %&gt;% anti_join(stop_words %&gt;% filter(lexicon == &quot;snowball&quot;)) %&gt;%
mutate(word_stem = textstem::stem_words(word)) %&gt;% count(id, word_stem)
df_all_tf_idf &lt;- df_all_tok %&gt;% bind_tf_idf(term = word_stem, document = id, n = n)
# turning the tf_idf into a matrix.
dtm_df_all &lt;- cast_dtm(term = word_stem, document = id, value = tf_idf, data = df_all_tf_idf)
mat_df_all &lt;- as.matrix(dtm_df_all)
dim(mat_df_all)</code></pre>
<pre><code>## [1] 10873 13802</code></pre>
<pre class="r"><code>length(unique(df_all$id)) </code></pre>
<pre><code>## [1] 10876</code></pre>
<pre class="r"><code># I have a problem! Some tweets have not made it to our matrix.
# That&#39;s probably because there were just a link, or just a number or just stop words.
# which one are those links. This is also why I have hanged the corpus of stop-words.
# so 3 tweets have not made it at all if we consider both training and testing set. </code></pre>
<p>Let’s have a look at our sparse matrix to better understand what’s going on.</p>
<pre class="r"><code>mat_df_all[1:10, 1:20]</code></pre>
<pre><code>## Terms
## Docs car crash happen just terribl allah deed
## 0 0.8580183 0.7837519 0.9588457 0.6432791 1.330996 0.0000000 0.000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.9851632 1.228699
## 2 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
## 3 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
## 4 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
## 5 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
## 6 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
## 7 0.0000000 0.0000000 0.0000000 0.3216396 0.000000 0.0000000 0.000000
## 8 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
## 9 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
## Terms
## Docs earthquak forgiv mai reason u citi differ
## 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## 1 0.7270493 0.9851632 0.6197444 0.7839108 0.4343156 0.0000000 0.000000
## 2 0.7270493 0.0000000 0.0000000 0.0000000 0.0000000 0.6864859 0.873712
## 3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## 4 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## 5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## 6 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## 7 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## 8 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## 9 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## Terms
## Docs everyon hear safe stai across fire
## 0 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## 2 0.7207918 0.6628682 0.873712 0.7776986 0.0000000 0.0000000
## 3 0.0000000 0.0000000 0.000000 0.0000000 0.6664668 0.3448581
## 4 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.4433889
## 5 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## 6 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## 7 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## 8 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.2586435
## 9 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000</code></pre>
<p>The values in the matrix are not the frequency but their tf_idf.</p>
<p>Let’s now fix the issues of the missing tweets or we will have some issues later on during the modeling workflow. We see that the matrix is ordered by ID</p>
<pre class="r"><code># Let&#39;s identify which tweets didn&#39;t make it into our df3 and save them.
df_mat_rowname &lt;- tibble(id = as.numeric(rownames(mat_df_all)))
df_rowname &lt;- tibble(id = df_all$id)
missing_id &lt;- df_rowname %&gt;% anti_join(df_mat_rowname)
# Let&#39;s add empty rows with the right id as rowname to our matrix.
yo &lt;- matrix(0.0, nrow = nrow(missing_id), ncol = ncol(mat_df_all))
rownames(yo) &lt;- missing_id$id
mat_df &lt;- rbind(mat_df_all, yo)
dim(mat_df)</code></pre>
<pre><code>## [1] 10876 13802</code></pre>
<pre class="r"><code>#mat_df3[7601:7613, 11290:11302]
### trying to keep track of the order of the matrix
mat_df_id &lt;- rownames(mat_df)
head(mat_df_id, 20)</code></pre>
<pre><code>## [1] &quot;0&quot; &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; &quot;5&quot; &quot;6&quot; &quot;7&quot; &quot;8&quot; &quot;9&quot; &quot;10&quot; &quot;11&quot; &quot;12&quot; &quot;13&quot; &quot;14&quot;
## [16] &quot;15&quot; &quot;16&quot; &quot;17&quot; &quot;18&quot; &quot;19&quot;</code></pre>
<pre class="r"><code>tail(mat_df_id, 20)</code></pre>
<pre><code>## [1] &quot;10859&quot; &quot;10860&quot; &quot;10861&quot; &quot;10862&quot; &quot;10863&quot; &quot;10864&quot; &quot;10865&quot; &quot;10866&quot; &quot;10867&quot;
## [10] &quot;10868&quot; &quot;10869&quot; &quot;10870&quot; &quot;10871&quot; &quot;10872&quot; &quot;10873&quot; &quot;10874&quot; &quot;10875&quot; &quot;6394&quot;
## [19] &quot;9697&quot; &quot;43&quot;</code></pre>
<p>Now that we solved that issue of missing rows (which took almost a all day to figure out), we can move to finding the dense matrix. We will use the <strong>irlba</strong> library to help with the decomposition.</p>
<pre class="r"><code>incomplete.cases &lt;- which(!complete.cases(mat_df))
mat_df[incomplete.cases,] &lt;- rep(0.0, ncol(mat_df))
dim(mat_df) </code></pre>
<pre><code>## [1] 10876 13802</code></pre>
<pre class="r"><code>svd_mat &lt;- irlba::irlba(t(mat_df), nv = 750, maxit = 2000)
write_rds(x = svd_mat, path = &quot;~/disaster_tweets/data/svd.rds&quot;)
# And then to save it the whole df with ID + svd
svd_mat &lt;- read_rds(&quot;~/disaster_tweets/data/svd.rds&quot;)
yo &lt;- as_tibble(svd_mat$v)
dim(yo)</code></pre>
<pre><code>## [1] 10876 750</code></pre>
<pre class="r"><code>df4 &lt;- bind_cols(id = as.numeric(mat_df_id), yo)
write_rds(x = df4, path = &quot;~/disaster_tweets/data/svd_df_all750.rds&quot;)</code></pre>
<p>It is worth mentioning that singular value decomposition didn’t parallelized on my machine and it took a bit over 3hrs to get the matrix. That’s why we have saved it for further used.
[When I used irlba on our university computer (84 cores, over 750 Gb of RAM), it did parallelized very nicely on all core and it didn’t take more than 5 min.]</p>
<p>Now that we have our dense matrix, we can start to fit back all the pieces together for our modelling process.</p>
<pre class="r"><code>df_train &lt;- read_csv(&quot;~/disaster_tweets/data/train.csv&quot;) %&gt;% clean_tweets()
# sorting out the same tweets, different target issues
temp &lt;- df_train %&gt;% group_by(text) %&gt;%
mutate(mean_target = mean(target),
new_target = if_else(mean_target &gt; 0.5, 1, 0)) %&gt;% ungroup() %&gt;%
mutate(target = new_target,
target_bin = factor(if_else(target == 1, &quot;a_truth&quot;, &quot;b_false&quot;))) %&gt;%
select(-new_target, -mean_target, -target)
df_svd &lt;- read_rds(&quot;~/disaster_tweets/data/svd_df_all750.rds&quot;)
df_train &lt;- left_join(temp, df_svd, by = &quot;id&quot;) %&gt;%
select(-text)</code></pre>
</div>
<div id="svd-with-lasso" class="section level1">
<h1>SVD with Lasso</h1>
<pre class="r"><code>set.seed(0109)
rsplit_df &lt;- initial_split(df_train, strata = target_bin, prop = 0.85)
df_train_tr &lt;- training(rsplit_df)
df_train_te &lt;- testing(rsplit_df)
# reusing the same df_train, df_train_tr, df_train_te from before.
recipe_tweet &lt;- recipe(formula = target_bin ~ ., data = df_train_tr) %&gt;%
update_role(id, new_role = &quot;ID&quot;) %&gt;%
step_zv(all_numeric(), -all_outcomes()) %&gt;%
step_normalize(all_numeric())
# we &#39;ll assign 40 different values for our penalty.
# we noticed earlier that best values are between penalties 0.001 and 0.005
grid_lambda &lt;- expand.grid(penalty = seq(0.0014,0.005, length = 45))
# This time we&#39;ll use 10 folds cross-validation
set.seed(0109)
folds_training &lt;- vfold_cv(df_train, v = 10, repeats = 1)
model_lasso &lt;- logistic_reg(mode = &quot;classification&quot;,
penalty = tune(), mixture = 1) %&gt;%
set_engine(&quot;glmnet&quot;)
# starting our worflow
wf_lasso &lt;- workflow() %&gt;%
add_recipe(recipe_tweet) %&gt;%
add_model(model_lasso)
library(doParallel)
registerDoParallel(cores = 64)
# run a lasso regression with cross-validation, on 40 different levels of penalty
tune_lasso &lt;- tune_grid(
wf_lasso,
resamples = folds_training,
grid = grid_lambda,
metrics = metric_set(roc_auc, f_meas, accuracy),
control = control_grid(verbose = TRUE)
)
tune_lasso %&gt;% collect_metrics() %&gt;%
write_csv(&quot;~/disaster_tweets/data/metrics_lasso_svd750.csv&quot;)
best_metric &lt;- tune_lasso %&gt;% select_best(&quot;f_meas&quot;)
wf_lasso &lt;- finalize_workflow(wf_lasso, best_metric)
last_fit(wf_lasso, rsplit_df) %&gt;% collect_metrics()</code></pre>
<pre><code>## # A tibble: 2 x 3
## .metric .estimator .estimate
## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
## 1 accuracy binary 0.798
## 2 roc_auc binary 0.860</code></pre>
<pre class="r"><code>#save the final lasso model
model_lasso_svd &lt;- fit(wf_lasso, df_train)
write_rds(x = model_lasso_svd, path = &quot;~/disaster_tweets/data/model_lasso_svd750.rds&quot;) </code></pre>
<p>Note 1 Lasso: svd with 1000L, normalize all, penalty 0.001681, scores: f1=73.99, acc =79.3, roc=85.4</p>
<div id="analysis-of-grid-results" class="section level2">
<h2>Analysis of grid results</h2>
<pre class="r"><code># we read the results of our sample to see the penalty values and their performances.
metrics &lt;- read_csv(&quot;~/disaster_tweets/data/metrics_lasso_svd750.csv&quot;)
metrics %&gt;%
ggplot(aes(x = penalty, y = mean, color = .metric)) +
geom_line() +
facet_wrap(~.metric) +
scale_x_log10()</code></pre>
<p><img src="/post/disaster-tweets-II/index_files/figure-html/grid-lasso-1.png" width="672" /></p>
</div>
<div id="make-predictions" class="section level2">
<h2>Make predictions</h2>
<pre class="r"><code>df_test &lt;- read_csv(&quot;~/disaster_tweets/data/test.csv&quot;) %&gt;% clean_tweets()
df_svd &lt;- read_rds(&quot;~/disaster_tweets/data/svd_df_all750.rds&quot;)
df_test &lt;- left_join(df_test, df_svd, by = &quot;id&quot;)
library(glmnet)
prediction_lasso_svd &lt;- tibble(id = df_test$id,
target = if_else(predict(model_lasso_svd, new_data = df_test) == &quot;a_truth&quot;, 1, 0))
prediction_lasso_svd %&gt;% write_csv(path = &quot;~/disaster_tweets/data/prediction_svd_lasso750.csv&quot;)
# clean everything
rm(list = ls())</code></pre>
<p>On the training set with cross-validation, this model with a penalty of 0.001681, gave us f1 = 73.99, accuracy = 79.3, roc = 85.4. On Kaggle, this model gave us a public score of 76.79. This is not really good considering we got much better results earlier with our <a href="https://fderyckel.github.io/post/disaster-tweets-part-i/#baseline-with-some-additional-features">enhanced approach</a></p>
</div>
</div>
<div id="svd-with-xgboost" class="section level1">
<h1>SVD with Xgboost</h1>
<p>We can use the same idea with xgboost.</p>
<pre class="r"><code>clean_tweets &lt;- function(df){
df &lt;- df %&gt;%
mutate(number_hashtag = str_count(string = text, pattern = &quot;#&quot;),
number_number = str_count(string = text, pattern = &quot;[0-9]&quot;) %&gt;% as.numeric(),
number_http = str_count(string = text, pattern = &quot;http&quot;) %&gt;% as.numeric(),
number_mention = str_count(string = text, pattern = &quot;@&quot;) %&gt;% as.numeric(),
number_location = if_else(!is.na(location), 1, 0),
number_keyword = if_else(!is.na(keyword), 1, 0),
number_repeated_char = str_count(string = text, pattern = &quot;([a-z])\\1{2}&quot;) %&gt;% as.numeric(),
text = str_replace_all(string = text, pattern = &quot;http[^[:space:]]*&quot;, replacement = &quot;&quot;),
text = str_replace_all(string = text, pattern = &quot;@[^[:space:]]*&quot;, replacement = &quot;&quot;),
number_char = nchar(text), #add the length of the tweet in character.
number_word = str_count(string = text, pattern = &quot;\\w+&quot;),
text = str_replace_all(string = text, pattern = &quot;[0-9]&quot;, replacement = &quot;&quot;),
text = map(text, textstem::lemmatize_strings) %&gt;% unlist(.),
text = map(text, function(.x) stringi::stri_trans_general(.x, &quot;Latin-ASCII&quot;)) %&gt;% unlist(.),
text = str_replace_all(string = text, pattern = &quot;\u0089&quot;, replacement = &quot;&quot;)) %&gt;%
select(-keyword, -location)
return(df)
}
df_train &lt;- read_csv(&quot;~/disaster_tweets/data/train.csv&quot;) %&gt;% clean_tweets()
# sorting out the same tweets, different target issues
temp &lt;- df_train %&gt;% group_by(text) %&gt;%
mutate(mean_target = mean(target),
new_target = if_else(mean_target &gt; 0.5, 1, 0)) %&gt;% ungroup() %&gt;%
mutate(target = new_target,
target_bin = factor(if_else(target == 1, &quot;a_truth&quot;, &quot;b_false&quot;))) %&gt;%
select(-new_target, -mean_target, -target)
df_svd &lt;- read_rds(&quot;~/disaster_tweets/data/svd_df_all750.rds&quot;)
df_train &lt;- left_join(temp, df_svd, by = &quot;id&quot;) %&gt;%
select(-text)
recipe_tweet &lt;- recipe(formula = target_bin ~ ., data = df_train) %&gt;%
update_role(id, new_role = &quot;ID&quot;)
# xgboost classification, tuning on trees, tree-depth and mtry
model_xgboost &lt;- boost_tree(mode = &quot;classification&quot;, trees = tune(),
learn_rate = 0.01, tree_depth = tune(), mtry = tune()) %&gt;%
set_engine(&quot;xgboost&quot;, nthread = 64)
# starting our workflow
wf_xgboost &lt;- workflow() %&gt;%
add_recipe(recipe_tweet) %&gt;%
add_model(model_xgboost)
# This time we use 5 folds cross-validation.
# xgboost is extremely resource intensive on wide df.
set.seed(0109)
folds_training &lt;- vfold_cv(df_train, v = 5, repeats = 1)
grid_xgboost &lt;- expand.grid(trees = c(2000),
tree_depth = c(5, 6),
mtry = c(150, 300))
library(doParallel)
registerDoParallel(cores = 64)
# run a xgboost classification with cross-validation
tune_xgboost &lt;- tune_grid(
wf_xgboost,
resamples = folds_training,
grid = grid_xgboost,
metrics = metric_set(roc_auc, f_meas, accuracy),
control = control_grid(verbose = TRUE, save_pred = TRUE)
)
tune_xgboost %&gt;% collect_metrics() %&gt;%
write_csv(&quot;~/disaster_tweets/data/metrics_xgboost_svd750.csv&quot;)
best_metric &lt;- tune_xgboost %&gt;% select_best(&quot;f_meas&quot;)
wf_xgboost &lt;- finalize_workflow(wf_xgboost, best_metric)
last_fit(wf_xgboost, rsplit_df) %&gt;% collect_metrics()</code></pre>
<pre><code>## # A tibble: 2 x 3
## .metric .estimator .estimate
## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
## 1 accuracy binary 0.825
## 2 roc_auc binary 0.883</code></pre>
<pre class="r"><code>#save the final lasso model
model_xgboost_svd &lt;- fit(wf_xgboost, df_train)
write_rds(x = model_xgboost_svd, path = &quot;~/disaster_tweets/data/model_xgboost_svd750.rds&quot;) </code></pre>
<p>Using xgboost in combination with svd gives much better results. Here are a few things that we have tried with our training data:</p>
<ul>
<li>svd 1000 wide matrix and xgboost with 150 mtry, 2500 trees, 5 tree-depth, gave us f1 = 74.77, accuracy = 80.90, roc = 86.45<br />
</li>
<li>svd 750 wide matrix and xgboost with 150 mtry, 2000 trees, 6 tree-depth, gave us f1 = 74.99, accuracy = 81.05, roc = 87</li>
<li>svd 500 wide matrix and xgboost with 200 mtry, 2000 trees, 6 tree-depth, gave us f1 = 75.11, accuracy = 81.02, roc = 86.87</li>
<li>svd 250 wide matrix and xgboost with 125 mtry, 1500 trees, 5 tree-depth, gave us f1 = 74.93, accuracy = 80.81, roc = 86.62</li>
</ul>
<div id="variable-importance" class="section level2">
<h2>variable importance</h2>
<pre class="r"><code>library(vip)
model_xgboost_svd %&gt;%
pull_workflow_fit() %&gt;%
vip::vip(geom = &quot;point&quot;, num_features=20) #%&gt;% arrange(desc(Importance)) %&gt;% </code></pre>
<p><img src="/post/disaster-tweets-II/index_files/figure-html/vip-1.png" width="672" /></p>
<p>Clearly, we can’t interpret anymore our variables as they are the result of singular variable decomposition of a tf-idf sparse matrix. However, we are happy to see that our extra variables have played a role in determining if a tweet was about real disaster or not.</p>
</div>
</div>
<div id="submission-of-results" class="section level1">
<h1>Submission of results</h1>
<pre class="r"><code>df_test &lt;- read_csv(&quot;~/disaster_tweets/data/test.csv&quot;) %&gt;% clean_tweets()
df_svd &lt;- read_rds(&quot;~/disaster_tweets/data/svd_df_all750.rds&quot;)
df_test &lt;- left_join(df_test, df_svd, by = &quot;id&quot;)
library(xgboost)
prediction_xgboost_svd &lt;- tibble(id = df_test$id,
target = if_else(predict(model_xgboost_svd, new_data = df_test) == &quot;a_truth&quot;, 1, 0))
prediction_xgboost_svd %&gt;% write_csv(path = &quot;~/disaster_tweets/data/prediction_svd_xgboost750.csv&quot;)</code></pre>
<p>Note 1: majority voting, svd with 850 wide, using lasso, got 77% public score.</p>
<p>Note 2: majority voting, svd 500 wide, using xgboost with 200 mtry, 2000 trees, 6 tree-depth, got a 80.01 public score.</p>
<p>Note 3: majority voting, svd with 750 wide, using xgboost with 200 mtry, 2000 trees, 6 tree-depth, got 81.29% public score. Yeahhh!!!!!!!</p>
<p>Here is a screenshot of our results:<br />
<img src="/img/screenshot-results.png" alt="screenshot of results" /></p>
</div>
<div id="references" class="section level1">
<h1>References</h1>
<ul>
<li>To help with the use of irlba and <a href="https://www.kaggle.com/barun2104/nlp-with-disaster-eda-dfm-svd-ensemble">check for the complete matrix</a></li>
</ul>
</div>
</description>
</item>
<item>
<title>Disaster Tweets - Part I</title>
<link>/post/disaster-tweets-part-i/</link>
<pubDate>Mon, 25 May 2020 00:00:00 +0000</pubDate>
<guid>/post/disaster-tweets-part-i/</guid>
<description>
<script src="/rmarkdown-libs/kePrint/kePrint.js"></script>
<div id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#baseline-model---lasso-model-on-just-text">Baseline model - Lasso model on just text</a><ul>
<li><a href="#creating-a-model-workflow">Creating a model workflow</a></li>
<li><a href="#analysis-of-results">Analysis of results</a></li>
<li><a href="#picking-the-best-model">Picking the best model</a></li>
<li><a href="#variable-importance">variable importance</a></li>
<li><a href="#submission-of-results">Submission of results</a></li>
</ul></li>
<li><a href="#baseline-with-some-additional-features">Baseline with some additional features</a><ul>
<li><a href="#rebuilding-the-data-frame-and-variables">Rebuilding the data frame and variables</a></li>
<li><a href="#creating-and-tuning-a-model">Creating and tuning a model</a></li>
<li><a href="#variable-importances">Variable importances</a></li>
<li><a href="#submission-of-results-1">Submission of results</a></li>
</ul></li>
<li><a href="#wonderings-and-lessons-learned.">Wonderings and lessons learned.</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<p><em>Real or Not? NLP with Disaster Tweets</em> Predict which Tweets are about real disasters and which ones are not.<br />
The task comes from a <a href="https://www.kaggle.com/c/nlp-getting-started">Kaggle competition</a> which is to detect if a tweet about an emergency disaster is real. Hence, this is an NLP classification problem.</p>
<p>It is kind of easy for a human to see if a tweet is real or not, but it is harder for a machine to detect it. For instance, the tweet <em>“look at the sky last night, it was ABLAZE”</em>. Although there is the use of a disaster keyword like “ablaze”, the use of that word in this context wasn’t meant to refer to an emergency disaster. This task is seen as <em>“a getting started”</em> problem by Kaggle.</p>
<p>As I’m a volunteer firefighter in my local community for the last 3 years, this Kaggle task struck a chord with me. And yes, that is me on the picture. Imagine this heavy, well insulated PPE, super intense physical challenge and then the Saudi heat with the humidity of the Red Sea ;-)</p>
<p>I am planning on a 3 parts post.</p>
<ul>
<li>The first part is very much BOW (bag of word) approach using Lasso.</li>
<li>The second part is still BOW approaches using SVD. Modelling with Lasso and Xgboost.</li>
<li>The third part is word embedding using Glove. (Still trying to make it work with Bert pre-trained models. Maybe I’ll have that sort out by the end. )</li>
</ul>
<p>Throughout these posts, I will use packages from 3 main sets: the <a href="https://www.tidyverse.org/">tidyverse</a> for data wrangling, the <a href="https://www.tidymodels.org/">tidymodels</a> for modelling and the <a href="https://www.tidytextmining.com/">tidytext</a> for dealing with text data. These sets of packages make a coherent whole and, in my opinion, makes it easier to learn the data analysis &amp; modelling workflow. It is, of course, not the only one. There are many other alternatives in R.</p>
<p>Loading the libraries first.</p>
<pre class="r"><code>library(readr) # to read and write (import / export) any type into our R console.
library(dplyr) # for pretty much all our data wrangling
library(stringr) # to deal with strings. this is a NLP task, so lots of it ;-)
library(purrr) # to map functions over rows
library(forcats) # to deal with categorical variables: the fct_reorder() function
library(stringr) # to use str_remove() and many other regex functions later
library(ggplot2) # to plot
library(kableExtra) # for making pretty table on html
library(rsample) # to split df with initial_split()
# to use resampling techniques with bootstrap() and vfold_cv()
library(parsnip) # the main engine that run the models
library(recipes) # to use the recipe() functions
library(textrecipes) # to use the step_tokenize() and step_tfidf()
library(workflows) # to use workflow()
library(tune) # to fine tune the hyper-parameters using tune()
library(dials) # to create grid of parameters using grid_regular(), tune_grid(), penalty()
library(yardstick) # to create the measure of accuracy, f1 score and ROC-AUC
library(glmnet) # to use lasso, it is called automatically when calling set_engine()
# but it isn&#39;t call later on when doing using predict()
library(vip) # tidy framework to check variables importance</code></pre>
<p>Without further adue, let’s get started by loading our training set and check its structure.</p>
<pre class="r"><code># loading our training data
df_train &lt;- read_csv(&quot;~/disaster_tweets/data/train.csv&quot;) %&gt;% as_tibble()
# let&#39;s have a look at it
skimr::skim(df_train)</code></pre>
<table style='width: auto;'
class='table table-condensed'>
<caption>
<span id="tab:loading-data">Table 1: </span>Data summary
</caption>
<thead>
<tr>
<th style="text-align:left;">
</th>
<th style="text-align:left;">
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
Name
</td>
<td style="text-align:left;">
df_train
</td>
</tr>
<tr>
<td style="text-align:left;">
Number of rows
</td>
<td style="text-align:left;">
7613
</td>
</tr>
<tr>
<td style="text-align:left;">
Number of columns
</td>
<td style="text-align:left;">
5
</td>
</tr>
<tr>
<td style="text-align:left;">
_______________________
</td>
<td style="text-align:left;">
</td>
</tr>
<tr>
<td style="text-align:left;">
Column type frequency:
</td>
<td style="text-align:left;">
</td>
</tr>
<tr>
<td style="text-align:left;">
character
</td>
<td style="text-align:left;">
3
</td>
</tr>
<tr>
<td style="text-align:left;">
numeric
</td>
<td style="text-align:left;">
2
</td>
</tr>
<tr>
<td style="text-align:left;">
________________________
</td>
<td style="text-align:left;">
</td>
</tr>
<tr>
<td style="text-align:left;">
Group variables
</td>
<td style="text-align:left;">
None
</td>
</tr>
</tbody>
</table>
<p><strong>Variable type: character</strong></p>
<table>
<thead>
<tr>
<th style="text-align:left;">
skim_variable
</th>
<th style="text-align:right;">
n_missing
</th>
<th style="text-align:right;">
complete_rate
</th>
<th style="text-align:right;">
min
</th>
<th style="text-align:right;">
max
</th>
<th style="text-align:right;">
empty
</th>
<th style="text-align:right;">
n_unique
</th>
<th style="text-align:right;">
whitespace
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
keyword
</td>
<td style="text-align:right;">
61
</td>
<td style="text-align:right;">
0.99
</td>
<td style="text-align:right;">
4
</td>
<td style="text-align:right;">
21
</td>
<td style="text-align:right;">
0
</td>
<td style="text-align:right;">
221
</td>
<td style="text-align:right;">
0
</td>
</tr>
<tr>
<td style="text-align:left;">
location
</td>
<td style="text-align:right;">
2534
</td>
<td style="text-align:right;">
0.67
</td>
<td style="text-align:right;">
1
</td>
<td style="text-align:right;">
49
</td>
<td style="text-align:right;">
0
</td>
<td style="text-align:right;">
3279
</td>
<td style="text-align:right;">
0
</td>
</tr>
<tr>
<td style="text-align:left;">
text
</td>
<td style="text-align:right;">
0
</td>
<td style="text-align:right;">
1.00
</td>
<td style="text-align:right;">
7
</td>
<td style="text-align:right;">
157
</td>
<td style="text-align:right;">
0
</td>
<td style="text-align:right;">
7503
</td>
<td style="text-align:right;">
0
</td>
</tr>
</tbody>
</table>
<p><strong>Variable type: numeric</strong></p>
<table>
<thead>
<tr>
<th style="text-align:left;">
skim_variable
</th>
<th style="text-align:right;">
n_missing
</th>
<th style="text-align:right;">
complete_rate
</th>
<th style="text-align:right;">
mean
</th>
<th style="text-align:right;">
sd
</th>
<th style="text-align:right;">
p0
</th>
<th style="text-align:right;">
p25
</th>
<th style="text-align:right;">
p50
</th>
<th style="text-align:right;">
p75
</th>
<th style="text-align:right;">
p100