Merge pull request #977 from adwityac/adwityac

Add Hate Speech Detection
abhisheks008 · Nov 10, 2024 · 4e88e95 · 4e88e95
2 parents f2e0150 + 85a9652
commit 4e88e95
Show file tree

Hide file tree

Showing 10 changed files with 28,764 additions and 0 deletions.
diff --git a/Hate Speech Detection/Dataset/Dataset---Hate-Speech-Detection-using-Deep-Learning.csv b/Hate Speech Detection/Dataset/Dataset---Hate-Speech-Detection-using-Deep-Learning.csv
diff --git a/Hate Speech Detection/Images/model_deployment_01.PNG b/Hate Speech Detection/Images/model_deployment_01.PNG
diff --git a/Hate Speech Detection/Images/model_deployment_02.PNG b/Hate Speech Detection/Images/model_deployment_02.PNG
diff --git a/Hate Speech Detection/Model/Hate_Speech_Detection_using_Deep_Learning.ipynb b/Hate Speech Detection/Model/Hate_Speech_Detection_using_Deep_Learning.ipynb
diff --git a/Hate Speech Detection/Model/readme.md b/Hate Speech Detection/Model/readme.md
@@ -0,0 +1,60 @@
+## **Hate Speech Detection**
+
+### 🎯 **Goal**
+
+The main goal of the project was to develop a deep learning model that accurately identifies and classifies hate speech in text data and to help identify and filter harmful language, promoting safer and more respectful online interactions.
+
+### 🧵 **Dataset**
+
+The dataset is taken from CrowdFlower - https://data.world/crowdflower/hate-speech-identification
+
+### 🧾 **Description**
+
+This project focuses on detecting hate speech in text using deep learning techniques. It involves preprocessing text data, training a neural network model, and evaluating its performance in classifying content as either hate speech or non-hate speech. The model aims to enhance online content moderation by identifying harmful language effectively, contributing to safer digital spaces.
+
+### 🧮 **What I had done!**
+
+1. **Data Loading**: Import the labeled text dataset.
+2. **Preprocessing**: Clean text by removing noise, tokenizing, and normalizing.
+3. **EDA**: Analyze class distribution and visualize data patterns.
+4. **Model Building**: Create a neural network with embedding and LSTM layers.
+5. **Training**: Train the model with a split of training and validation data.
+6. **Evaluation**: Assess performance using metrics like accuracy and F1-score.
+7. **Visualization**: Plot accuracy and loss to check model performance.
+8. **Prediction**: Use the model to classify new text as hate speech or non-hate speech.
+
+### 🚀 **Models Implemented**
+
+The project uses an LSTM (Long Short-Term Memory) model with an embedding layer to detect hate speech. LSTM was chosen because it effectively captures the context and long-term dependencies in sequential text data, making it well-suited for understanding language patterns. The embedding layer helps convert words into dense vectors, enhancing the model's ability to grasp semantic relationships, while a final dense layer with a sigmoid activation performs binary classification of the text.
+
+### 📚 **Libraries Needed**
+
+Here are all the libraries used in this project:
+
+1. **NumPy**: For numerical operations and array handling.
+2. **Pandas**: For data manipulation and analysis.
+3. **Matplotlib**: For creating visualizations and plots.
+4. **Seaborn**: For statistical data visualization.
+5. **NLTK (Natural Language Toolkit)**: For text preprocessing tasks like tokenization and stopword removal.
+6. **Scikit-learn**: For data splitting, metrics evaluation, and preprocessing utilities.
+7. **TensorFlow/Keras**: For building and training the deep learning model.
+8. **re (Regular Expressions)**: For text cleaning and preprocessing.
+9. **String**: For handling text processing tasks.
+
+### 📊 **Exploratory Data Analysis Results**
+![model_deployment_01](https://github.com/user-attachments/assets/1c8cb248-9ff1-4dd3-af0f-f00e080854f9)
+![model_deployment_02](https://github.com/user-attachments/assets/341dab93-3293-4f2e-9a8f-1464a2b4a57a)
+
+
+### 📈 **Performance of the Models based on the Accuracy Scores**
+
+The project used an **LSTM (Long Short-Term Memory) Network** as the main algorithm. It achieved an accuracy of approximately **85%** on the test dataset. The results indicated a strong performance in detecting hate speech, with balanced precision, recall, and F1-score, showcasing its effectiveness in handling complex and context-dependent text data.
+
+
+### 📢 **Conclusion**
+
+Differentiating hate speech from offensive language is a challenging task. Our approach, which involves text pre-processing and feature extraction (e.g., n-gram tf-idf, sentiment polarity, doc2vec, and readability scores), demonstrates the benefits of using these features for classification. The evaluation of models based on accuracy and F1-scores highlights the complexity of the problem. While the results show the potential of the proposed features, further analysis and error review could improve feature extraction methods and help address existing challenges in detecting toxic language on platforms like Twitter.
+
+### ✒️ **Your Signature**
+
+Adwitya Chakraborty
diff --git a/Hate Speech Detection/Web App/app.py b/Hate Speech Detection/Web App/app.py
@@ -0,0 +1,83 @@
+from flask import Flask, render_template, request, jsonify
+from flask_wtf import FlaskForm
+from wtforms import StringField, SubmitField
+from wtforms.validators import DataRequired
+import tensorflow as tf
+import tensorflow_text  # prerequisite for using the BERT preprocessing layer
+import numpy as np
+from dotenv import load_dotenv
+import os
+
+# Load environment variables from .env file
+load_dotenv()
+
+# Create the Flask web application
+app = Flask(__name__)
+
+# Set a secret key (stored in .env) as a security measure (e.g. protecting against CSRF attacks) 
+app.config["SECRET_KEY"] = os.getenv("SECRET_KEY")
+
+# Load the TensorFlow model
+model = tf.keras.models.load_model("saved_models/model3")
+
+
+# Create hate speech detection form class (that inherits from the Flask WTForm class)
+class HateSpeechForm(FlaskForm):
+    comment = StringField("Social Media Comment", validators=[DataRequired()])
+    submit = SubmitField("Run")
+
+
+# Home route 
+@app.route("/", methods=["GET", "POST"])
+def home():
+    # Instantiate a hate speech form class object
+    form = HateSpeechForm()
+    # If the user submitted valid information in the hate speech form
+    if form.validate_on_submit():
+        # Get the input text from the form
+        input_text = form.comment.data
+        # Convert input text to a list
+        input_data = [input_text]
+        # Make prediction using the TensorFlow model
+        prediction_prob = model.predict(input_data)[0][0]
+        # Convert prediction probability to percent
+        prediction_prob = np.round(prediction_prob * 100, 1)
+        # Convert prediction probability to prediction in text form
+        if prediction_prob >= 50:
+            prediction = "Hate Speech"
+        else:
+            prediction = "No Hate Speech"
+            # Invert the prediction probability
+            prediction_prob = 100 - prediction_prob
+        # Render the prediction and prediction probability in the index.html template
+        return render_template("index.html", 
+                               form=form, 
+                               prediction=prediction, 
+                               prediction_prob=prediction_prob)
+    return render_template("index.html", form=form)
+
+
+# API route
+@app.route("/api")
+def prediction_by_api():
+    # Get the input text from the api query parameter
+    input_text = request.args.get("comment")
+    # Convert input text to a list
+    input_data = [input_text]
+    # Make prediction using the TensorFlow model
+    prediction_prob = model.predict(input_data)[0][0]
+    # Convert prediction probability to prediction in text form
+    if prediction_prob >= 0.5:
+        prediction = "Hate Speech"
+    else:
+        prediction = "No Hate Speech"
+        # Invert the prediction probability
+        prediction_prob = 1 - prediction_prob
+    # Return json with the prediction and prediction probability
+    return jsonify({"prediction": prediction,
+                    "probability": float(prediction_prob)})
+
+
+# Start the Flask web application
+if __name__ == "__main__":
+    app.run(debug=True)
diff --git a/Hate Speech Detection/Web App/model_deployment_api.gif b/Hate Speech Detection/Web App/model_deployment_api.gif
diff --git a/Hate Speech Detection/Web App/readme.md b/Hate Speech Detection/Web App/readme.md
@@ -0,0 +1,14 @@
+## Hate Speech Detection
+
+### Goal 🎯
+The main goal of the project was to develop a deep learning model that accurately identifies and classifies hate speech in text data and to help identify and filter harmful language, promoting safer and more respectful online interactions.
+
+### Model(s) used for the Web App 🧮
+The application uses a TensorFlow model with BERT for binary classification of hate speech vs non-hate speech. The model produces probabilities between 0-1, with 0.5 as the decision threshold.
+
+### Video Demonstration 🎥
+![model_deployment_api](https://github.com/user-attachments/assets/e89599e4-8271-4c65-aefd-17078c1fc9c9)
+
+
+### Signature ✒️
+Adwitya Chakraborty
diff --git a/Hate Speech Detection/Web App/requirements_deployment.txt b/Hate Speech Detection/Web App/requirements_deployment.txt
@@ -0,0 +1,81 @@
+absl-py==1.4.0
+asttokens==2.2.1
+astunparse==1.6.3
+backcall==0.2.0
+cachetools==5.3.1
+certifi==2023.5.7
+charset-normalizer==3.2.0
+click==8.1.4
+cloudpickle==2.2.1
+colorama==0.4.6
+comm==0.1.3
+debugpy==1.6.7
+decorator==5.1.1
+executing==1.2.0
+Flask==1.1.2
+Flask-WTF==0.14.3
+flatbuffers==23.5.26
+gast==0.4.0
+google-auth==2.22.0
+google-auth-oauthlib==0.4.6
+google-pasta==0.2.0
+grpcio==1.56.0
+gunicorn==20.1.0
+h5py==3.8.0
+idna==3.4
+importlib-metadata==6.7.0
+ipykernel==6.16.2
+ipython==7.34.0
+itsdangerous==2.0.1
+jedi==0.18.2
+Jinja2==3.0.0
+jupyter_client==8.0.0a1
+jupyter_core==5.0.0rc2
+keras==2.10.0
+Keras-Preprocessing==1.1.2
+libclang==16.0.0
+Markdown==3.4.3
+MarkupSafe==2.1.3
+matplotlib-inline==0.1.6
+nest-asyncio==1.5.6
+numpy==1.21.6
+oauthlib==3.2.2
+opt-einsum==3.3.0
+packaging==23.1
+parso==0.8.3
+pickleshare==0.7.5
+platformdirs==3.8.1
+prompt-toolkit==3.0.39
+protobuf==3.19.6
+psutil==5.9.5
+pure-eval==0.2.2
+pyasn1==0.5.0
+pyasn1-modules==0.3.0
+Pygments==2.15.1
+python-dateutil==2.8.2
+python-dotenv==0.21.1
+pyzmq==25.1.0
+requests==2.31.0
+requests-oauthlib==1.3.1
+rsa==4.9
+six==1.16.0
+spyder-kernels==2.2.0
+stack-data==0.6.2
+tensorboard==2.10.1
+tensorboard-data-server==0.6.1
+tensorboard-plugin-wit==1.8.1
+tensorflow==2.10.0
+tensorflow-estimator==2.10.0
+tensorflow-hub==0.13.0
+tensorflow-io-gcs-filesystem==0.31.0
+tensorflow-text==2.10.0
+termcolor==2.3.0
+tornado==6.2
+traitlets==5.9.0
+typing_extensions==4.7.1
+urllib3==1.26.16
+wcwidth==0.2.6
+Werkzeug==2.0.3
+wrapt==1.15.0
+WTForms==2.3.3
+zipp==3.15.0
diff --git a/Hate Speech Detection/requirements.txt b/Hate Speech Detection/requirements.txt
@@ -0,0 +1,7 @@
+numpy
+pandas
+matplotlib
+seaborn
+scikit-learn
+nltk
+tensorflow