Note: This project is derived from this Cloudera Machine Learning (CML) Applied Machine Learning Project Prototype project. It's a bit less CML specific and drops the python parts. It's now pretty much all done with sparklyr
Images Source: https://toddwschneider.com/posts/the-simpsons-by-the-data/
Ever wondered who the happiest Simpsons character is and what you can do once armed with that knowledge? This project is for you!
Note: There is a lot of supporting info and documentation in each of the project files, so it's worth taking a look through the files.
This project is 2 main parts:
- Data Analysis
- Model Deployment
1_Data_Analysis.Rmd
and 2_Sentiment_Score_Creator.R
These two files go through the process of creating a labelled data set that gives a sentiment score to each line of dialogue from the Simpsons. (which we get from here) https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons.
1_Data_Analysis.Rmd
is the full process, including an analysis of this data to try calculate
which Simpsons character has the highest "happiness" score across their dialogue from the show.
2_Sentiment_Score_Creator.R
recreates the initial steps to label the data and saves is for use
in the Part 2. If you knit the 1_Data_Analysis.Rmd
file, it will show the process for the
analysis include the results in the final .html
file.
The 3_Word2vec_Model.R
and 4_Bert_Model.R
files take the labelled data and build two
different sentiment classifier models, one based on Word2vec and the other on BERT for the
word embeddings process. These are both Spark based models. Word2vec uses standard Spark ML
functionality, while the BERT model uses the SparkNLP port for sparklyr from r-spark
The 5_Shiny_App.R
file runs a Shiny Application that loads both models and makes new sentiment
predictions for a new sentence.