Our project Real Time Twitter Sentiment Analysis, revolves around the idea of using unsuopervised Machine Learning approaches to classify the twitter data(tweets) into sentiment categories of POSITIVE, NEGATIVE or NEUTRAL.
- Analysis of Tweets from Twitter Usernames and Keywords.
- Classification of Tweets based on their sentiments in real-time.
- Interactive Charts and Graphs visualizing the corresponding twitter engagement.
- Options to choose custom input attributes like range of dates, maximum number of tweets to be fetched, etc.
- Dashboard presenting a complete twitter-performance-chart for the respective Username or keyword.
- Analysis of user engagement on the Twitter, based on different languages used, number of retweets and distribution of tweets over weekdays.
-
Twint package is used for fetching tweets from Twitter in realtime.
-
Training the Sentiment Model:
- NLTK provides several modules for data-preprocessing and Natural Language Processing in Python.
- Preprocessing utilities from NTLK like stopwords, porter stemmer were used during the Text preprocessing stage in preparing the training dataset to be fed into the model.
- Twitter Sentiment Dataset from Kaggle is used for gathering data to train the sentiment-model.
- ScikitLearn provides useful model libraries.
- SkLeanr's TfIdf Vectorizer was used for preparing the embedded matrix.
- Followed by it, K-Means Clustering model is used to cluster the semantically similar words from the embedded matrix and derive the cluster centers of three different sentiments.
- Gensim provides fast utilites for training NLP models and vector embeddings.
- Word2Vec model from gensim was used for vector embeddings.
- Pickle was used for serializing trained models and using them for prediction and production. The trained models were pickled and dumped in the directory for further use.
- NLTK provides several modules for data-preprocessing and Natural Language Processing in Python.
-
Dashboard for Twitter Analysis:
-
Determining the accuracy of the Sentiment Analysis Model: For determining the accuracy, a dataset was choosen and its polarity was determined using pretrained Supervised ML model Vader Sentiment Analyser and then the F1 score was calculated using both the labelled data and the predicted data.
- The accuracy of the model stands at:
75.2%
- The accuracy of the model stands at:
Using a Twitter-Username for Analysing data
Using a Keyword for Analysing data
The project has several use cases in the industry ranging from, Analysing the sentiment of Users on Twitter for a particular product or service, to managing and proctoring the twitter engagement for tweets related a particular topic. The dashboard can act as a perfect tool for analysing market performance and further deciding the future of the service or product offered.
For setting up the project on a local machine
-
Fork this repository.
-
Clone the repository using simple zip download or use the command
git clone https://github.com/gautamanirudh/twitterdash.git
-
Move to the master branch by using command
git checkout master
-
Create a virtual environment for the project
pip install virtualenv virtualenv -p /usr/bin/python3 env_name
-
Activate the Virtual environment
source env_name/bin/activate
Once the virtual environment is activated, the name of your virtual environment will appear on left side of terminal. This will let you know that the virtual environment is currently active.
-
Install all the dependencies
pip install -r requirements.txt
-
To start the Dashboard app, run the command
python app.py
Above Steps are sufficient for running the dashboard and analyzing realtime twitter data sentiment performance. But, for running the preprocessing and training model files, nltk data has to be downloaded to access the utilities. For that use the command:
```
nltk.download()
```
** Userful Resources