Skip to content

Commit

Permalink
Merge pull request #977 from adwityac/adwityac
Browse files Browse the repository at this point in the history
Add Hate Speech Detection
  • Loading branch information
abhisheks008 authored Nov 10, 2024
2 parents f2e0150 + 85a9652 commit 4e88e95
Show file tree
Hide file tree
Showing 10 changed files with 28,764 additions and 0 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,565 changes: 1,565 additions & 0 deletions Hate Speech Detection/Model/Hate_Speech_Detection_using_Deep_Learning.ipynb

Large diffs are not rendered by default.

60 changes: 60 additions & 0 deletions Hate Speech Detection/Model/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
## **Hate Speech Detection**

### 🎯 **Goal**

The main goal of the project was to develop a deep learning model that accurately identifies and classifies hate speech in text data and to help identify and filter harmful language, promoting safer and more respectful online interactions.

### 🧵 **Dataset**

The dataset is taken from CrowdFlower - https://data.world/crowdflower/hate-speech-identification

### 🧾 **Description**

This project focuses on detecting hate speech in text using deep learning techniques. It involves preprocessing text data, training a neural network model, and evaluating its performance in classifying content as either hate speech or non-hate speech. The model aims to enhance online content moderation by identifying harmful language effectively, contributing to safer digital spaces.

### 🧮 **What I had done!**

1. **Data Loading**: Import the labeled text dataset.
2. **Preprocessing**: Clean text by removing noise, tokenizing, and normalizing.
3. **EDA**: Analyze class distribution and visualize data patterns.
4. **Model Building**: Create a neural network with embedding and LSTM layers.
5. **Training**: Train the model with a split of training and validation data.
6. **Evaluation**: Assess performance using metrics like accuracy and F1-score.
7. **Visualization**: Plot accuracy and loss to check model performance.
8. **Prediction**: Use the model to classify new text as hate speech or non-hate speech.

### 🚀 **Models Implemented**

The project uses an LSTM (Long Short-Term Memory) model with an embedding layer to detect hate speech. LSTM was chosen because it effectively captures the context and long-term dependencies in sequential text data, making it well-suited for understanding language patterns. The embedding layer helps convert words into dense vectors, enhancing the model's ability to grasp semantic relationships, while a final dense layer with a sigmoid activation performs binary classification of the text.

### 📚 **Libraries Needed**

Here are all the libraries used in this project:

1. **NumPy**: For numerical operations and array handling.
2. **Pandas**: For data manipulation and analysis.
3. **Matplotlib**: For creating visualizations and plots.
4. **Seaborn**: For statistical data visualization.
5. **NLTK (Natural Language Toolkit)**: For text preprocessing tasks like tokenization and stopword removal.
6. **Scikit-learn**: For data splitting, metrics evaluation, and preprocessing utilities.
7. **TensorFlow/Keras**: For building and training the deep learning model.
8. **re (Regular Expressions)**: For text cleaning and preprocessing.
9. **String**: For handling text processing tasks.

### 📊 **Exploratory Data Analysis Results**
![model_deployment_01](https://github.com/user-attachments/assets/1c8cb248-9ff1-4dd3-af0f-f00e080854f9)
![model_deployment_02](https://github.com/user-attachments/assets/341dab93-3293-4f2e-9a8f-1464a2b4a57a)


### 📈 **Performance of the Models based on the Accuracy Scores**

The project used an **LSTM (Long Short-Term Memory) Network** as the main algorithm. It achieved an accuracy of approximately **85%** on the test dataset. The results indicated a strong performance in detecting hate speech, with balanced precision, recall, and F1-score, showcasing its effectiveness in handling complex and context-dependent text data.


### 📢 **Conclusion**

Differentiating hate speech from offensive language is a challenging task. Our approach, which involves text pre-processing and feature extraction (e.g., n-gram tf-idf, sentiment polarity, doc2vec, and readability scores), demonstrates the benefits of using these features for classification. The evaluation of models based on accuracy and F1-scores highlights the complexity of the problem. While the results show the potential of the proposed features, further analysis and error review could improve feature extraction methods and help address existing challenges in detecting toxic language on platforms like Twitter.

### ✒️ **Your Signature**

Adwitya Chakraborty
83 changes: 83 additions & 0 deletions Hate Speech Detection/Web App/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
from flask import Flask, render_template, request, jsonify
from flask_wtf import FlaskForm
from wtforms import StringField, SubmitField
from wtforms.validators import DataRequired
import tensorflow as tf
import tensorflow_text # prerequisite for using the BERT preprocessing layer
import numpy as np
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Create the Flask web application
app = Flask(__name__)

# Set a secret key (stored in .env) as a security measure (e.g. protecting against CSRF attacks)
app.config["SECRET_KEY"] = os.getenv("SECRET_KEY")

# Load the TensorFlow model
model = tf.keras.models.load_model("saved_models/model3")


# Create hate speech detection form class (that inherits from the Flask WTForm class)
class HateSpeechForm(FlaskForm):
comment = StringField("Social Media Comment", validators=[DataRequired()])
submit = SubmitField("Run")


# Home route
@app.route("/", methods=["GET", "POST"])
def home():
# Instantiate a hate speech form class object
form = HateSpeechForm()
# If the user submitted valid information in the hate speech form
if form.validate_on_submit():
# Get the input text from the form
input_text = form.comment.data
# Convert input text to a list
input_data = [input_text]
# Make prediction using the TensorFlow model
prediction_prob = model.predict(input_data)[0][0]
# Convert prediction probability to percent
prediction_prob = np.round(prediction_prob * 100, 1)
# Convert prediction probability to prediction in text form
if prediction_prob >= 50:
prediction = "Hate Speech"
else:
prediction = "No Hate Speech"
# Invert the prediction probability
prediction_prob = 100 - prediction_prob
# Render the prediction and prediction probability in the index.html template
return render_template("index.html",
form=form,
prediction=prediction,
prediction_prob=prediction_prob)
return render_template("index.html", form=form)


# API route
@app.route("/api")
def prediction_by_api():
# Get the input text from the api query parameter
input_text = request.args.get("comment")
# Convert input text to a list
input_data = [input_text]
# Make prediction using the TensorFlow model
prediction_prob = model.predict(input_data)[0][0]
# Convert prediction probability to prediction in text form
if prediction_prob >= 0.5:
prediction = "Hate Speech"
else:
prediction = "No Hate Speech"
# Invert the prediction probability
prediction_prob = 1 - prediction_prob
# Return json with the prediction and prediction probability
return jsonify({"prediction": prediction,
"probability": float(prediction_prob)})


# Start the Flask web application
if __name__ == "__main__":
app.run(debug=True)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions Hate Speech Detection/Web App/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
## Hate Speech Detection

### Goal 🎯
The main goal of the project was to develop a deep learning model that accurately identifies and classifies hate speech in text data and to help identify and filter harmful language, promoting safer and more respectful online interactions.

### Model(s) used for the Web App 🧮
The application uses a TensorFlow model with BERT for binary classification of hate speech vs non-hate speech. The model produces probabilities between 0-1, with 0.5 as the decision threshold.

### Video Demonstration 🎥
![model_deployment_api](https://github.com/user-attachments/assets/e89599e4-8271-4c65-aefd-17078c1fc9c9)


### Signature ✒️
Adwitya Chakraborty
81 changes: 81 additions & 0 deletions Hate Speech Detection/Web App/requirements_deployment.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
absl-py==1.4.0
asttokens==2.2.1
astunparse==1.6.3
backcall==0.2.0
cachetools==5.3.1
certifi==2023.5.7
charset-normalizer==3.2.0
click==8.1.4
cloudpickle==2.2.1
colorama==0.4.6
comm==0.1.3
debugpy==1.6.7
decorator==5.1.1
executing==1.2.0
Flask==1.1.2
Flask-WTF==0.14.3
flatbuffers==23.5.26
gast==0.4.0
google-auth==2.22.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.56.0
gunicorn==20.1.0
h5py==3.8.0
idna==3.4
importlib-metadata==6.7.0
ipykernel==6.16.2
ipython==7.34.0
itsdangerous==2.0.1
jedi==0.18.2
Jinja2==3.0.0
jupyter_client==8.0.0a1
jupyter_core==5.0.0rc2
keras==2.10.0
Keras-Preprocessing==1.1.2
libclang==16.0.0
Markdown==3.4.3
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
nest-asyncio==1.5.6
numpy==1.21.6
oauthlib==3.2.2
opt-einsum==3.3.0
packaging==23.1
parso==0.8.3
pickleshare==0.7.5
platformdirs==3.8.1
prompt-toolkit==3.0.39
protobuf==3.19.6
psutil==5.9.5
pure-eval==0.2.2
pyasn1==0.5.0
pyasn1-modules==0.3.0
Pygments==2.15.1
python-dateutil==2.8.2
python-dotenv==0.21.1
pyzmq==25.1.0
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
six==1.16.0
spyder-kernels==2.2.0
stack-data==0.6.2
tensorboard==2.10.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.10.0
tensorflow-estimator==2.10.0
tensorflow-hub==0.13.0
tensorflow-io-gcs-filesystem==0.31.0
tensorflow-text==2.10.0
termcolor==2.3.0
tornado==6.2
traitlets==5.9.0
typing_extensions==4.7.1
urllib3==1.26.16
wcwidth==0.2.6
Werkzeug==2.0.3
wrapt==1.15.0
WTForms==2.3.3
zipp==3.15.0
7 changes: 7 additions & 0 deletions Hate Speech Detection/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
numpy
pandas
matplotlib
seaborn
scikit-learn
nltk
tensorflow

0 comments on commit 4e88e95

Please sign in to comment.