Skip to content

Latest commit

 

History

History
445 lines (272 loc) · 11.8 KB

README.md

File metadata and controls

445 lines (272 loc) · 11.8 KB
title author abstract
Timing Comparisons for Supervised Learning of a Classifier
Author One, Udacity, Machine Learning Nanodegree
Author Two, Udacity, Machine Learning Nanodegree
Author Three, Udacity, Machine Learning Nanodegree
Author Four, Udacity, Machine Learning Nanodegree
This paper describes $n$ methods for fitting a binary supervised learning classifier on a single large dataset with multiple-typed features. Timing and accuracy metrics are presented for each method, with analysis on the results in terms of the structure of the data set. Fits were performed using the popular open-source machine learning library `scikit-learn`. Additionally, a code repository including all necessary infrastructure has been developed and shared for reproducibility of results.

Project Structure

based on this work: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424

/data     <-- data
/doc      <-- write up, ipynb, latex
/lib      <-- code
/results  <-- output

System Design

For portability and reproducibility of results, we have elected to use the Docker system and its Dockerfile syntax to prepare. As this work is done using Python and its scikit-learn libraries we have elected to use a system built via the Anaconda package manager. Furthermore, leveraging images designed by and for using the Jupyter system, which is built via Anaconda, allows a single container to be used both for running the analysis script and for interactive analysis of the data via Jupyter. We use a Docker image designed and maintained by the Jupyter team.

Running commands

We have designed a Makefile to make working with the docker system easier.

Via Makefile, data can be pre processed,

$ make wrangle_data

all classifiers,

$ make all_classifiers

or via an interactive notebook server

$ make notebook_server

Note that the last leverages a built-in launch script inherited from the original notebook definition, in that no explicit command was passed to the container.

Data Set

Select a dataset Proposed requirements:

  • Large but not too large i.e. can fit on a single system running Docker
  • lends itself to binary classification
  • many different types of feature parameters
  • from UCI Machine Learning Dataset Library

Dataset

https://archive.ics.uci.edu/ml/datasets/Adult (proposed) Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Data Set Information:

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

Attribute Information:

Listing of attributes:

50K, <=50K.

age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Relevant Papers:

Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996

Data Visualization

Feature Engineering

one-hot encode classification parameters convert all booleans to numeric values

Split Data Set

  • training
  • test
  • use seed for reproducibility

Models

PUT YOUR NAME NEXT TO ONE YOU WOULD LIKE TO IMPLEMENT

- calibration.CalibratedClassifierCV                   (JOSHUA)
- discriminant_analysis.LinearDiscriminantAnalysis     (JOSHUA)
- discriminant_analysis.QuadraticDiscriminantAnalysis  (JOSHUA)
- dummy.DummyClassifier                                (JOSHUA)
- ensemble.AdaBoostClassifier                          (DAVID)
- ensemble.BaggingClassifier                           (BHARAT)
- ensemble.ExtraTreesClassifier
- ensemble.GradientBoostingClassifier                  (NASH)
- ensemble.RandomForestClassifier                      (MATT)
- ensemble.RandomTreesEmbedding
- ensemble.RandomTreesEmbedding
- ensemble.VotingClassifier                            (BHARAT)
- linear_model.LogisticRegression                      (JOSHUA)
- linear_model.PassiveAggressiveClassifier
- linear_model.RidgeClassifier                         (BHARAT)
- linear_moder.SGDClassifier                           (JOSHUA)
- multiclass.OneVsOneClassifier
- multiclass.OneVsRestClassifier
- multiclass.OutputCodeClassifier
- naive_bayes.BernoulliNB                              (ANDREY)
- naive_bayes.GaussianNB                               (ANDREY)
- naive_bayes.MultinomialNB
- neighbors.KNeighborsClassifier                       (MATT)
- neighbors.NearestCentroid
- neighbors.RadiusNeighborsClassifier
- neural_network.BernoulliRBM                          (MAXIME)
- semi_supervised.LabelPropagation
- svm.LinearSVC
- svm.NuSVC                                            (ANDREY)
- svm.SVC                                              (MATT)
- tree.DecisionTreeClassifier                          (MATT)
- tree.ExtraTreeClassifier

Discriminant Analysis

TODO: General description.

Linear discriminant analysis

Brief description

Strengths

Weaknesses

Quadratic discriminant analysis

Brief description

Strengths

Weaknesses

Ensemble

Ensemble methods are not learning algorithms themselves, in the sense that they map features to some output. Rather, the ensemble technique is a meta-algorithm that combines many learners together to create one single learner. These base learners (those being combined) are typically either constructed to be high bias (i.e. boosting) or high variance (i.e. bagging). When combined, whether additively or through voting or otherwise, these base learners come together to produce one strong, regularized model. There are countless ensemble meta-algorithms; what follows is an analysis of most of the common ensemble methods.

AdaBoost

Brief description

Adaptive Boosting uses a large number of "weak" learners to make predictions with high accuracy. These learners are all weighted and their collective output is used to make a classification.

Strengths

  • It doesn't require high-accuracy classifiers

Weaknesses

  • More complicated than a single classifier

Gradient Boosting

Brief description

In any boosting algorithm, the shortcomings of the existing model are what the next learner focuses on. In gradient boosting, those shortcomings are identified the gradient of the cost function, $r_i=\frac{\partial{L(y_i,F(x_i))}}{\partial{F(x_i)}}$ for $i=1,...,n$. Once this is computed, the next learner is fit to a new dataset constructed from those gradients (or "residuals", but gradients is the more general term) as $D={(x_i,r_i)}_{i=1}^{n}$. The model can then be updated by adding this new weak learner, and the process begins again. Gradient boosting is most commonly used with decision trees as the base learners. Gradient boosting can be shown to be a more general case of AdaBoost, one that is able to handle any differentiable cost function.

Strengths

  • as with most ensemble methods, gradient boosting tends to do better than individual trees because intuitively, it is taking the best that each tree has to offer and adding it all up
  • with the ability to generalize to any cost function, gradient boosting has the potential to be robust to outliers; this and similar properties can be obtained by the selection of an appropriate cost function

Weaknesses

  • to a point, the strength of the model is proportional to its computational cost; the more trees added, the more expensive the training process
  • overfitting is quite easy and effective regularization is necessary; this is controllable by the hyper-parameters, most importantly n_estimators, the number of trees

Random Forest

Brief description

Strengths

Weaknesses

Bagging Classifier

Brief description

Strengths

Weaknesses

Extra Trees Classifier

Brief description

Strengths

Weaknesses

Random Trees Embedding

Brief description

Strengths

Weaknesses

Voting Classifier

Brief description

Strengths

Weaknesses

Linear Model

TODO: General description.

Logistic Regression

Brief description

Strengths

Weaknesses

Ridge Classifier

Brief description

Strengths

Weaknesses

SGD Classifier

Brief description

Strengths

Weaknesses

Passive Aggressive Classifier

Brief description

Strengths

Weaknesses

Multiclass

TODO: General description.

One VS One Classifier

Brief description

Strengths

Weaknesses

One VS Rest Classifier

Brief description

Strengths

Weaknesses

Output Code Classifier

Brief description

Strengths

Weaknesses

Naive Bayes

TODO: General description.

Gaussian NB

Brief description

Strengths

Weaknesses

Bernoulli NB

Brief description

Strengths

Weaknesses

Multinomial NB

Brief description

Strengths

Weaknesses

Neighbors

TODO: General description.

K Neighbors Classifier

Brief description

Strengths

Weaknesses

Nearest Centroid

Brief description

Strengths

Weaknesses

Radius Neighbours Classifier

Brief description

Strengths

Weaknesses

SVM

TODO: General description.

Support Vector Classifier

Brief description

Strengths

Weaknesses

Linear Support Vector Classifier

Brief description

Strengths

Weaknesses

Nu Support Vector Classifier

Brief description

Strengths

Weaknesses

Tree

TODO: General description.

Decision Tree Classifier

Brief description

Strengths

Weaknesses

Extra Trees Classifier

Brief description

Strengths

Weaknesses

Misc

TODO: General description.

Calibrated Classifier CV

Brief description

Strengths

Weaknesses

Dummy Classifier

Brief description

Strengths

Weaknesses

Bernoulli Restricted Boltzmann Machine

Brief description

Strengths

Weaknesses

Label Propagation

Brief description

Strengths

Weaknesses

List of Supervised Learning Models here.

Metrics

What metrics should be used for timing, for accuracy, others?

Pipeline

  1. raw fit of classifier
  2. raw prediction of classifier
  3. gridsearchCV fit
  4. prediction on tuned model

Analysis

Highest performing model What this says about the data set chosen