-
Notifications
You must be signed in to change notification settings - Fork 0
/
easySentimentAnalyseR_BoW_Twitter.rmd
187 lines (130 loc) · 8.25 KB
/
easySentimentAnalyseR_BoW_Twitter.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: "easySentimentAnalyseR"
subtitle: "Twitter_BoW"
author: "Jack"
date: "October 21, 2017"
output:
html_document:
fig_caption: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
*This is a demo for my side project* <a href="https://jackho327.github.io/easySentimentAnalyseR/", target="_top">easySentimentAnalyseR</a>
```{r message=FALSE, warning=FALSE, echo=FALSE, results='hide'}
# this simple sentiment AnalyseR is based on bag-of-words(BoW)
# load all necessary libraries
source("./load_libs.r")
# set up the connection to your Twitter dev account
source("./easySentimentAnalyseR_Setup_Twitter.r")
# load functions
source("./funcs.r")
# load potentially necessary dictionaries
source("./load_dictionaries.r")
```
After get authenticated, modify the value of `search_str` and `num_twts`, also you could set an actual start/end date if you want to. Based on my own testing, I found the earliest date could be set to 10 or 11 days before your current date.
For example, I just set em as below.
```{r message=FALSE, warning=FALSE, cache=TRUE}
search_str = "lakers"
num_twts = 500 # the number of tweets you wanna search
lang = "en" # check lang at https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
# pay attention to the rate limiting
# see --> (https://developer.twitter.com/en/docs/basics/rate-limiting)
# ?searchTwitteR to see other potential args
your_tweets <- searchTwitter(searchString = search_str, n = num_twts, lang = lang, since = "2017-10-20", until = "2017-10-21")
```
After adjusting the cleaning process, I could generate such a word cloud below.
```{r message=FALSE, warning=FALSE, echo=FALSE}
# fetch all the texts
tweets_list <- sapply(your_tweets, function(x){
x$getText() %>%
str_replace_all(pattern = "http(s)?.*",replacement = " ") %>% # clear urls beforehand
str_replace_all(pattern = "^RT|@([a-zA-Z0-9:]*\\s+){1}", replacement = "") # clear @users
})
# create copy of your corpus
# cleaning process should be decided based on your actual searching condition
copy_tweets_corpus <- VCorpus(VectorSource(tweets_list)) %>%
tm_map(str_to_lower) %>% # turn all characters to lower case
tm_map(removeNumbers) %>% # remove numbers
tm_map(removePunctuation) %>%
tm_map(function(x) {
removeWords(x, c(stopwords,str_replace_all(str_to_lower(search_str), "^#", ""))) %>% # remove stop words and the searching word
str_replace_all(pattern = "[^a-zA-Z\\s'-]",replacement = "")
}) %>%
tm_map(PlainTextDocument)
# clean corpus based your actual searching condition
tweets_corpus <- VCorpus(VectorSource(tweets_list)) %>% # generate a volatile corpus (you can try "Permanent" one or "Simple" one)
tm_map(str_to_lower) %>% # turn all characters to lower case
tm_map(removeNumbers) %>% # remove numbers
tm_map(removePunctuation) %>% # remove puncuations
tm_map(function(x) {
removeWords(x, c(stopwords,str_replace_all(str_to_lower(search_str), "^#", ""))) %>% # remove stop words and the searching word
str_replace_all(pattern = "[^a-zA-Z\\s'-]",replacement = "")
}) %>%
tm_map(stemDocument) %>% # do stemming, sometimes it will bring confused results, used it based on you actual requirement
tm_map(function(x){
tmp <- strsplit(x = x, split = " ")[[1]]
tmp <- tmp[tmp!=""]
paste(stemCompletion(tmp, dictionary = copy_tweets_corpus),collapse = ' ')
}) %>%
tm_map(stripWhitespace) %>% # clean extra white spaces
tm_map(PlainTextDocument) # turn the corppus into plain text doc
```
Based on the word cloud below, you could perceive that people still don't get enought about talking Lonzo Ball's official debut and the "welcome come to NBA" he got from Patrick Beverley. Haha, I think that's a good lesson for Lonzo, even Patrick was too harsh. What's more, it also easily to find out that many people mentioned about Kobe Bryant (salute to Black Mamba) and Ingram also attracts the spotlight.
```{r message=FALSE, warning=FALSE, echo=FALSE, fig.align='center'}
# tweets_dtm <- TermDocumentMatrix(tweets_corpus)
# tweets_dtm_matrix <- as.matrix(tweets_dtm)
# v <- sort(rowSums(tweets_dtm_matrix),decreasing=TRUE)
# tweets_dtm_dt <- data.table(word = names(v),freq=v)
# head(tweets_dtm_dt, 10)
# set.seed(1234)
# wordcloud(words = tweets_dtm_dt$word, freq = tweets_dtm_dt$freq, colors = brewer.pal(n = 5, name = "Dark2"), rot.per = .3, min.freq = 2, scale = c(4,1), random.color = F, max.words = 50, random.order = F)
# save the codes above, draw word cloud with func: wordcloud() directly
set.seed(123) # set a seed to keep the word cloud reproducible
wordcloud(tweets_corpus, colors = brewer.pal(n = 5, name = "Dark2"), rot.per = .3, min.freq = 2, scale = c(4,1), random.color = F, max.words = 50, random.order = F)
```
However, still on this word cloud, actually you could try to express it in some other ways, for example, you could say since Lonzo got a nice performance (29 pts, 9 asist and 10 rebounds) in his 2nd game, just one day after his debut, so may be his fans just treat it as a fight back to the Patrick though :-)
Also, you may apply a hierarchical clustering on the top words and try to get other insights.
```{r message=FALSE, warning=FALSE, echo=FALSE, echo=FALSE, fig.align='center'}
# create term doc matrix
tweets_dtm <- TermDocumentMatrix(x = tweets_corpus) # generate term doc matrix
tweets_dtm<- tweets_dtm %>% removeSparseTerms(sparse=0.94) # remove highly sparse terms, the %>% was avoided in order to tune results conveniently
set.seed(123)
# cluster words - based on hierarchical clustering
tweets_hclust <- tweets_dtm %>%
scale() %>% # scale the matrix
dist(method = "euclidean") %>% # calculate the distances among terms based on "euclidean" distance
hclust(method = "average")
# visualize hierarchical clustering results
par(mar=c(9,4,4,2))
draw_dendrogram(hclust = tweets_hclust, num_cluster = 3, main_title = "Word Clusters", border_type = 8, border_lin_type = 5, border_line_width = 2)
```
In this way, you could indeed get some information and sort of conjectures about the keyword you searched.
However, it is still a little difficult for you to know how excatly people's emotion or reaction related to the topic you searched.
Next, I will try to do a sentiment calculation to get the answer of the question above.
```{r message=FALSE, warning=FALSE}
# calculate sentiment scores based on J.Breen approach (https://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/)
scores <- calclulate_score(tweets_corpus,pos,neg)
```
As you may see above, the sentiment score is 0.39 (lower than 0.5), and it means when people mentioned the word 'lakers' in twitter, most of them may intruduce a sort of negative emotions.
```{r message=FALSE, warning=FALSE, echo=FALSE}
global.score <- round(sum(scores$pos.score)/(sum(scores$pos.score)+sum(scores$neg.score)),digits = 2)
df <- data.table(search_str = search_str, Sys_date = Sys.Date(), score = global.score)
df
```
However, you have to know that the scores are calculated based on the actual dictionary you used. For example, you could apply the `nrc` dictionary here to see the sentiment scores.
```{r message=FALSE, warning=FALSE, fig.align='center'}
tweet_char_vec <- sapply(tweets_corpus, '[', "content") %>% unlist %>% as.character() %>% paste(collapse = ' ')
tweet_sent <- get_nrc_sentiment(tweet_char_vec)
```
Then, you could make a plot as below to show the sentimental information about your searching keyword.
```{r message=FALSE, warning=FALSE, fig.align='center', echo=FALSE}
tweet_tb <- tweet_sent %>% data.table() %>% t
tweet_tb_rownames <- rownames(tweet_tb)
tweet_tb <- tweet_tb %>% data.table()
colnames(tweet_tb) <- "sent_score"
rownames(tweet_tb) <- tweet_tb_rownames
ggplot(tweet_tb, aes(x=reorder(rownames(tweet_tb), sent_score),y=sent_score)) + geom_bar(stat='identity', fill = brewer.pal(n = 10, name = "Spectral")) + ggtitle(label = "Sentiment Scores", subtitle = "based on NRC dictionary") + labs(x = "category") + theme_bw()
```
Also, you may try other dictionaries such as "bing", "afinn" and "syuzhet", and calculate the sentiment scores respectively to get a comprehensive analysis about the sentiment of your corpus.
It's about the end of this demo and hope you all enjoy it!!