Book Popularity Prediction

This machine learning project predicts the popularity of books based on various features such as review summary, review text, review helpfulness, price, and categories. The goal is to classify books into "Popular" or "Unpopular" categories using a Random Forest model.

Dataset

The dataset used in this project is books.csv, which contains the following columns:

review/summary: A brief summary of the book review.
review/text: The full text of the book review.
review/helpfulness: A string in the format 'good_reviews/total_reviews' representing the number of helpful reviews and the total number of reviews.
price: The price of the book.
popularity: The popularity of the book, classified as 'Popular' or 'Unpopular'.
authors: The author(s) of the book.
categories: The categories the book belongs to.

Project Overview

1. Data Preprocessing

Feature Extraction: The features extracted from the dataset include text data (review/summary, review/text), numerical data (price, review/helpfulness), and categorical data (authors, categories).
Splitting the 'review/helpfulness' Column: The review/helpfulness column is split into two new columns, good_reviews and total_reviews, representing the number of helpful reviews and total reviews, respectively.
Price Scaling: The price column is scaled using StandardScaler for normalization.
Popularity Classification: The popularity column is converted into a binary classification target (1 for popular, 0 for unpopular).
Text Processing: TfidfVectorizer is used to convert review/summary and review/text into numerical features.
Categorical Data Encoding: Authors are encoded using LabelEncoder, and categories are transformed into dummy variables.

2. Feature Engineering

Numerical Features: The numerical features include scaled price, good_reviews, and total_reviews.
Text Features: The text features are extracted using TF-IDF vectorization on the review/summary and review/text.
Categorical Features: One-hot encoding is used to create dummy variables for the categories.

3. Model Training and Evaluation

Random Forest Classifier: A Random Forest model is trained on the combined features, including numerical, text-based, and categorical features. The model is evaluated using accuracy on a test set.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
README.md		README.md
notebook.ipynb		notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book Popularity Prediction

Dataset

Project Overview

1. Data Preprocessing

2. Feature Engineering

3. Model Training and Evaluation

About

Releases

Packages

Languages

Akram-Toumi/Book-Popularity-Prediction

Folders and files

Latest commit

History

Repository files navigation

Book Popularity Prediction

Dataset

Project Overview

1. Data Preprocessing

2. Feature Engineering

3. Model Training and Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages