title | author | abstract | |||||
---|---|---|---|---|---|---|---|
Timing Comparisons for Supervised Learning of a Classifier |
|
|
based on this work: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424
/data <-- data
/doc <-- write up, ipynb, latex
/lib <-- code
/results <-- output
For portability and reproducibility of results, we have elected to use the Docker system and its Dockerfile
syntax to prepare. As this work is done using Python and its scikit-learn
libraries we have elected to use a system built via the Anaconda package manager. Furthermore, leveraging images designed by and for using the Jupyter system, which is built via Anaconda, allows a single container to be used both for running the analysis script and for interactive analysis of the data via Jupyter. We use a Docker image designed and maintained by the Jupyter team.
We have designed a Makefile
to make working with the docker system easier.
Via Makefile
, data can be pre processed,
$ make wrangle_data
all classifiers,
$ make all_classifiers
or via an interactive notebook server
$ make notebook_server
Note that the last leverages a built-in launch script inherited from the original notebook definition, in that no explicit command was passed to the container.
Select a dataset Proposed requirements:
- Large but not too large i.e. can fit on a single system running Docker
- lends itself to binary classification
- many different types of feature parameters
- from UCI Machine Learning Dataset Library
https://archive.ics.uci.edu/ml/datasets/Adult (proposed) Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Data Set Information:
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
Prediction task is to determine whether a person makes over 50K a year.
Attribute Information:
Listing of attributes:
50K, <=50K.
age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Relevant Papers:
Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996
one-hot encode classification parameters convert all booleans to numeric values
- training
- test
- use seed for reproducibility
PUT YOUR NAME NEXT TO ONE YOU WOULD LIKE TO IMPLEMENT
- calibration.CalibratedClassifierCV (JOSHUA)
- discriminant_analysis.LinearDiscriminantAnalysis (JOSHUA)
- discriminant_analysis.QuadraticDiscriminantAnalysis (JOSHUA)
- dummy.DummyClassifier (JOSHUA)
- ensemble.AdaBoostClassifier (DAVID)
- ensemble.BaggingClassifier (BHARAT)
- ensemble.ExtraTreesClassifier
- ensemble.GradientBoostingClassifier (NASH)
- ensemble.RandomForestClassifier (MATT)
- ensemble.RandomTreesEmbedding
- ensemble.RandomTreesEmbedding
- ensemble.VotingClassifier (BHARAT)
- linear_model.LogisticRegression (JOSHUA)
- linear_model.PassiveAggressiveClassifier
- linear_model.RidgeClassifier (BHARAT)
- linear_moder.SGDClassifier (JOSHUA)
- multiclass.OneVsOneClassifier
- multiclass.OneVsRestClassifier
- multiclass.OutputCodeClassifier
- naive_bayes.BernoulliNB (ANDREY)
- naive_bayes.GaussianNB (ANDREY)
- naive_bayes.MultinomialNB
- neighbors.KNeighborsClassifier (MATT)
- neighbors.NearestCentroid
- neighbors.RadiusNeighborsClassifier
- neural_network.BernoulliRBM (MAXIME)
- semi_supervised.LabelPropagation
- svm.LinearSVC
- svm.NuSVC (ANDREY)
- svm.SVC (MATT)
- tree.DecisionTreeClassifier (MATT)
- tree.ExtraTreeClassifier
TODO: General description.
Ensemble methods are not learning algorithms themselves, in the sense that they map features to some output. Rather, the ensemble technique is a meta-algorithm that combines many learners together to create one single learner. These base learners (those being combined) are typically either constructed to be high bias (i.e. boosting) or high variance (i.e. bagging). When combined, whether additively or through voting or otherwise, these base learners come together to produce one strong, regularized model. There are countless ensemble meta-algorithms; what follows is an analysis of most of the common ensemble methods.
Adaptive Boosting uses a large number of "weak" learners to make predictions with high accuracy. These learners are all weighted and their collective output is used to make a classification.
- It doesn't require high-accuracy classifiers
- More complicated than a single classifier
In any boosting algorithm, the shortcomings of the existing model are what the next learner focuses on. In gradient boosting, those shortcomings are identified the gradient of the cost function,
- as with most ensemble methods, gradient boosting tends to do better than individual trees because intuitively, it is taking the best that each tree has to offer and adding it all up
- with the ability to generalize to any cost function, gradient boosting has the potential to be robust to outliers; this and similar properties can be obtained by the selection of an appropriate cost function
- to a point, the strength of the model is proportional to its computational cost; the more trees added, the more expensive the training process
- overfitting is quite easy and effective regularization is necessary; this is controllable by the hyper-parameters, most importantly n_estimators, the number of trees
TODO: General description.
TODO: General description.
TODO: General description.
TODO: General description.
TODO: General description.
TODO: General description.
TODO: General description.
List of Supervised Learning Models here.
What metrics should be used for timing, for accuracy, others?
- raw fit of classifier
- raw prediction of classifier
- gridsearchCV fit
- prediction on tuned model
Highest performing model What this says about the data set chosen