The project aims to evaluate and select the optimal supervised learning algorithm available that is adequate to accurately model individuals' income using data collected from the 1994 U.S. Census. In addition to accurately predicts whether an individual makes more than $50,000. Understanding an individual's income can help a non-profit better understand how large of a donation to request.
Based on the accuracy and f-score and the training time the best model is Random Forest Classifier (RFC). Since we are dealing with a classfication problem using Random Forest would be optimal and fast and easy to communicate results to the stakeholders.
- preprocessing
- Transforming Skewed Continuous Features
- Normalizing Numerical Features
- One-hot Encoding
- Implement performance metrics to evaluate the potential algorithms
- Choosing the Best Model & Model Tuning
This project requires Python 3.x and the following Python libraries installed:
You will also need to have software installed to run and execute an iPython Notebook
The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", by Ron Kohavi. You may find this paper online, with the original dataset hosted on UCI.