-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path1_Data_Analysis.Rmd
293 lines (246 loc) · 9.74 KB
/
1_Data_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
title: "Data Analysis R-Notebook"
output:
html_document:
df_print: paged
---
## Data Analysis of Simpsons Dialogue
*a.k.a: Who is the happiest Simpsons Character?*
This R Notebook will show how to create
a labeled sentiment data set for the dialogue from the Simpsons TV show. It will
help you work out which of the Simpsons characters is the happiest based on this analysis.
This is more about showing the process than being as accurate as
possible, but going through the list at the end it seems like a reasonable outcome.
The data set for this comes from Kaggle and you can see it here: https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons
The data set is quite small and it has been included with the repo.
This project uses the `dplyr`, `sparklyr`, and `ggplot2` packages. There is a weird issue with CRAN and installing `sparklyr` at the moment, so if you install things in the following order, it will work.
```
install.packages("dplyr")
install.packages("tibble")
install.packages("sparklyr")
install.packages("ggplot2")
install.packages("ggthemes")
```
#### Load the libraries
```{r}
library(sparklyr)
library(dplyr)
library(ggplot2)
library(ggthemes)
```
#### Create the Spark connection
```{r}
config <- spark_config()
config$spark.driver.memory <- "8g"
config$spark.master <- "local" #This is here because the master='local' in the spark_connect function is having issues. If you're running on your local machine, you shouldn't need it.
config$`sparklyr.cores.local` <- 2
config$`sparklyr.shell.driver-memory` <- "8G"
config$spark.memory.fraction <- 0.9
# If you want to run in distributed mode, uncomment these lines and change master = 'local' to
# master = 'yarn-client'. Both above for the config parameter and below in the in
# spark_connect function.
#config$spark.executor.memory <- "8g"
#config$spark.executor.cores <- "2"
# If you're running this on CML in distributed mode, uncomment the following and make
# sure your project has environment variable named STORAGE that points to the right
# hive warehouse storage location.
# See: https://github.com/fastforwardlabs/cml_churn_demo_mlops/blob/master/0_bootstrap.py
#storage <- Sys.getenv("STORAGE")
#config$spark.yarn.access.hadoopFileSystems <- storage
sc <- spark_connect(master="local", config = config)
# Also for CML, the following will give you the URL to the Spark UI to open in a new Window
# paste("http://spark-",Sys.getenv("CDSW_ENGINE_ID"),".",Sys.getenv("CDSW_DOMAIN"),sep="")
```
### Read in the CSV data
The text data is a very simple data set. Its 2 columns, one for the character, and one for their
dialog. Since we know that its 2 columns of `characters` we will set the schema and save
spark the trouble of doing it automatically with `infer_schema`
Here is some sample data.
```
Miss Hoover,"No, actually, it was a little of both. Sometimes when a disease is in all he magazines and all the news shows, it's nly natural that you think you have it."
Lisa Simpson,Where's Mr. Bergstrom?
Miss Hoover,I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?
Lisa Simpson,That life is worth living.
```
```{r}
cols = list(
raw_character_text = "character",
spoken_words = "character"
)
spark_read_csv(
sc,
name = "simpsons_spark_table",
path = "data/simpsons_dataset.csv",
infer_schema = FALSE,
columns = cols,
header = TRUE
)
```
The other data set we will use is the AFINN list. https://github.com/fnielsen/afinn/tree/master/afinn/data
Its 2 columns, the first being the word and the second and integer value for its `valance`, i.e. positive vs
negative words.
```
abandon -2
abandoned -2
abandons -2
abducted -2
```
```{r}
spark_read_csv(
sc,
name = "afinn_table",
path = "data/AFINN-en-165.txt",
infer_schema = TRUE,
delimiter = ",",
header = FALSE
)
```
#### Create local references for the Spark tables.
Once the spark dataframes have been read in, we need a local reference for them to use with `dplyr`
verbs. You can also just assign the local reference to the read using
`my_clever_var_name <- spark_read_csv(` etc.
```{r}
afinn_table <- tbl(sc, "afinn_table")
afinn_table <- afinn_table %>% rename(word = V1, value = V2)
simpsons_spark_table <- tbl(sc, "simpsons_spark_table")
as.data.frame(head(simpsons_spark_table))
```
```{r}
simpsons_spark_table %>% count()
```
This shows us there are 158314 lines of dialogue in the text corpus.
#### Basic Data Cleaning
The first to do is renaming a column to make type it out easier as we go. Its not a huge
improvement really and it makes me happy, so there.
```{r}
simpsons_spark_table <-
simpsons_spark_table %>%
rename(raw_char = raw_character_text)
```
Dropping null / NA values
```{r}
simpsons_spark_table <-
simpsons_spark_table %>%
na.omit()
```
#### Top Speakers
This table shows the number of lines spoken by each character.
```{r}
simpsons_spark_table %>% group_by(raw_char) %>% count() %>% arrange(desc(n))
```
## Text Mining
In this section, we need to start processing the text to get it into a numeric format that a classifier
model can use. There is a lot of good detail on how to do that using sparklyr [here](https://spark.rstudio.com/guides/textmining/) and this section uses a lot of those functions.
#### Remove Punctuation
Punctuation is removed using the Hive UDF `regexp_replace`.
```{r}
simpsons_spark_table <-
simpsons_spark_table %>%
mutate(spoken_words = regexp_replace(spoken_words, "\'", "")) %>%
mutate(spoken_words = regexp_replace(spoken_words, "[_\"():;,.!?\\-]", " "))
```
#### Tokenize
Tokenizing separates the sentences into a list of individual words.
```{r}
simpsons_spark_table <-
simpsons_spark_table %>%
ft_tokenizer(input_col="spoken_words",output_col= "word_list")
```
#### Remove Stop Words
Remove the smaller, commonly used words that don't add relevance for sentiment like "and" and "the".
```{r}
simpsons_spark_table <-
simpsons_spark_table %>%
ft_stop_words_remover(input_col = "word_list", output_col = "wo_stop_words")
```
#### Write to Hive (optional)
We might need both of these dataframes for use later, so in CML you can write both tables to
the default Hive database.
```{r}
#spark_write_table(simpsons_spark_table,"simpsons_spark_table",mode = "overwrite")
#spark_write_table(afinn_table,"afinn_table",mode = "overwrite")
```
## So Who is the Happiest Simpson?
This part is where there could be some debate about the best approach. Ideally each of the 158000
lines of dialogue should be labeled by an actual human in the context of the show to get a valid
data set. But who has the time or money for that? So lets take a different approach. First we
take each of the tokenized sentences and use the Hive UDF `explode` to put each word into its own
row, but still with its associated line of dialogue.
(_Note:_ We're only do this for sentences of more than 2 words.)
```{r}
sentences <- simpsons_spark_table %>%
mutate(word = explode(wo_stop_words)) %>%
select(spoken_words, word) %>%
filter(nchar(word) > 2) %>%
compute("simpsons_spark_table")
sentences
```
Then we use the AFINN table to assign a numeric value to each word that has a corresponding value
in the table using an `inner_join`. These values are grouped back together and summed to
give each line of dialogue an integer value.
```{r}
sentence_values <- sentences %>%
inner_join(afinn_table) %>%
group_by(spoken_words) %>%
summarise(weighted_sum = sum(value))
sentence_values
```
And below you can see how these integer values are distributed.
```{r}
weighted_sum_summary <- sentence_values %>% sdf_describe(cols="weighted_sum")
weighted_sum_summary
```
```{r}
density_plot <- function(X) {
hist(X, prob=TRUE, col="grey", breaks=500, xlim=c(-10,10), ylim=c(0,0.2))# prob=TRUE for probabilities not counts
lines(density(X), col="blue", lwd=2) # add a density estimate with defaults
}
density_plot(as.data.frame(sentence_values %>% select(weighted_sum))$weighted_sum)
```
What is interesting about the above graph is how the distribution of the weighted_sum calculation is
not normal but more bimodal. This lends itself to a Positive / Negative sentiment classification.
#### Calculate Sentiment Values by Character
This is repeating some of the steps above, but restricting it to the top 30 characters by number of lines of dialogue.
```{r}
simpsons_final_words <- simpsons_spark_table %>% mutate(word = explode(wo_stop_words)) %>%
select(word, raw_char) %>%
filter(nchar(word) > 2) %>%
compute("simpsons_spark_table")
top_chars <- simpsons_final_words %>%
group_by(raw_char) %>%
count() %>%
arrange(desc(n)) %>%
head(30) %>%
as.data.frame()
top_chars <- as.vector(top_chars$raw_char)
```
## Sentiment Analysis
This repeats the AFINN word list `inner_join` process but this time for words by character
in the top 30 list. Also the AFINN value is summed and then weighted according to number of
words spoken rather just the raw value at that would push those who say the most to the top.
```{r}
happiest_characters <- simpsons_final_words %>%
filter(raw_char %in% top_chars) %>%
inner_join(afinn_table) %>%
group_by(raw_char) %>%
summarise(weighted_sum = sum(value)/count()) %>%
arrange(desc(weighted_sum)) %>%
as.data.frame()
```
```{r}
happiest_characters
```
And here we have the results. The happiest Character is Lenny Leonard. I was surprised it wasn't
Ned, but he's up there and there and this project could be done using different approaches that
could have him higher up the list.
```{r}
p <-
ggplot(happiest_characters, aes(reorder(raw_char,weighted_sum), weighted_sum))+
theme_tufte(base_size=14, ticks=F) +
geom_col(width=0.75, fill = "grey") +
theme(axis.title=element_blank(),text=element_text(size=14, family="sans serif")) +
coord_flip()
p
```
We should have known really.
![Horray for Lenny](images/lenny.jpg)