Skip to content

MACHINE LEARNING / NLP / AMAZON SAGEMAKER: Plagiarism detector in 3 ML model versions (XGBoost, Random Forest & my implemementation) deployed on Amazon SageMaker.

Notifications You must be signed in to change notification settings

karolcichosz/project-plagiarism-detection

Repository files navigation

Plagiarism Detection with 3 Machine Learning models

This repository contains code and associated files for deploying a plagiarism detector using Amazon SageMaker, Sklearn and PyTorch.

Project Overview

This project is about to build a plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text. Detecting plagiarism is an active area of research; the task is non-trivial and the differences between paraphrased answers and original work are often not so obvious.

This project will be broken down into three main notebooks:

Notebook 1: Data Exploration

  • Load in the corpus of plagiarism text data.
  • Explore the existing data features and the data distribution.

Notebook 2: Feature Engineering

  • Clean and pre-process the text data.
  • Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
  • Select "good" features, by analyzing the correlations between different features.
  • Create train/test files that hold the relevant features and class labels for train/test data points.

Notebook 3: Train and Deploy Your Model in SageMaker

  • Upload train/test feature data to S3.
  • Define a binary classification model and a training script in 3 versions:
    • XGBoost build-in Amazon SageMager impelmentation - 92% of accuracy,
    • RandomForest Sklearn implementation - 96% of accuracy,
    • my Full Connected layers implementation using PyTorch - 100% of accuracy.
  • Train your model and deploy it using SageMaker.

Installation

You need Amazon Sagemaker and a Notebook instance to run this code. While creating a Notebook instance you can add link to this repository, so can can have this project placed in your Notebook instance. All step by step detailed instructions are provided in Jupiter Notebook files 1, 2 & 3.

Licence

Copyright: Karol Cichosz: The content of this repository is licensed under MIT licence.

About

MACHINE LEARNING / NLP / AMAZON SAGEMAKER: Plagiarism detector in 3 ML model versions (XGBoost, Random Forest & my implemementation) deployed on Amazon SageMaker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published