Toxicity Classification

Overview

This project focuses on automating the detection and classification of toxic behavior in user-generated content, using machine learning models to predict various types of toxicity in Wikipedia comments. The project explores different machine learning approaches to create a robust model that contributes to safer digital communication spaces.

About the Data

The dataset comprises Wikipedia comments labeled by human raters for toxic behavior, including:

Toxic
Severe Toxic
Obscene
Threat
Insult
Identity Hate

Data Preprocessing

Removal of stop words, URLs, non-alphanumeric characters, and extra whitespaces using the NLTK library and regex.
Utilization of TF-IDF vectorization and Embed4all feature extractor for numerical representation of text data.

Models Explored

Multinomial Naïve Bayes Classifier
Logistic Regression Classifier
Linear Support Vector Classifier
Custom Neural Network
Pretrained Encoder-Transformers (BERT)

Key Findings

The project highlighted the effectiveness of different machine learning models in toxicity classification.
Advanced models like Artificial Neural Networks and Pre-Trained Encoded Transformers showed promising results, outperforming traditional classification models.
Class imbalance was addressed using the LeastSampleClassSampler technique, improving model performance.

Technologies Used

Python
Pandas, NumPy
NLTK
Scikit-learn
TensorFlow, PyTorch
BERT (Bidirectional Encoder Representations from Transformers)

Installation

Clone this repository and install the required libraries:

git clone https://github.com/Mustafa-Ashfaq81/Toxicity-Classification.git

Usage

Refer to the Jupyter notebooks and Python scripts in the repository to run the models and analyze the results.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.gitattributes		.gitattributes
Notebook_1.ipynb		Notebook_1.ipynb
Notebook_2.ipynb		Notebook_2.ipynb
Notebook_3.ipynb		Notebook_3.ipynb
README.md		README.md
bert.png		bert.png
naive.png		naive.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxicity Classification

Overview

About the Data

Data Preprocessing

Models Explored

Key Findings

Technologies Used

Installation

Usage

About

Releases

Packages

Languages

Mustafa-Ashfaq81/Toxicity-Classification

Folders and files

Latest commit

History

Repository files navigation

Toxicity Classification

Overview

About the Data

Data Preprocessing

Models Explored

Key Findings

Technologies Used

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages