Download, explore, and wrangle the Titanic passenger manifest dataset with an eye toward developing a predictive model for survival.
This tutorial is based on the Kaggle Competition,"Predicting Survival Aboard the Titanic"
Licensed under CC BY-SA 3.0 via Wikimedia Commons: "Cd51-1000g" by Boris Lux
Start by cloning this repository.
Anaconda users: you should have everything you need, but if you find you are missing anything, type this into the command line:
conda install -c https://conda.anaconda.org/blaze <package>
Others: make sure the required libraries are installed by using:
pip install -r requirements.txt
Then look inside the data folder and open train.csv
to check out the dataset we'll be exploring today.
To start the lab, open up the iPython Notebook file: titanic_wrangling.ipynb
.
- How to explore a new dataset?
- What to look for in tabular data?
- What visualization tools can you use to help you explore?
- What is the end goal of data wrangling? Why are we even doing this?
- What to clean and how to clean it?
See also:
Baby steps to performing exploratory analysis in Python
Data munging using Pandas
(You will do this portion in the Machine Learning course.)
The iPython Notebook for this class is called "titanicML_workshop.ipynb." To get it, navigate in the command line to the titanic repository that you cloned for the last class, and try:
git stash
git pull origin master
If you haven't already installed Scikit-learn, do that now.
Anaconda users: you already have Scikit-learn! If you ever find you are missing anything, type this into the command line:
conda install -c https://conda.anaconda.org/blaze <package>
Everyone else, make sure Scikit-learn is installed:
WINDOWS USERS:
pip install -U scikit-learn
MAC OSX USERS:
pip install -U numpy scipy scikit-learn
LINUX w/ Python 2:
sudo apt-get install build-essential python-dev python-setuptools \
python-numpy python-scipy \
libatlas-dev libatlas3gf-base
sudo apt-get install python-matplotlib
LINUX w/ Python 3:
sudo apt-get install build-essential python3-dev python3-setuptools \
python3-numpy python3-scipy \
libatlas-dev libatlas3gf-base
sudo apt-get install python-matplotlib
Problems with installation? Check out: http://scikit-learn.org/stable/install.html
If you get hung up with the installation or the repo update, you can also get the gist: https://gist.github.com/rebeccabilbro/d40599f4ec96aa21dc48
Machine Learning
Classification
Cross-Validation
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Model Evaluation
-Scores
-Classification reports
-Visualization tools
-Precision recall
Linear Regression
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Random Forests
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Support Vector Machines
http://scikit-learn.org/stable/modules/svm.html
This tutorial is based on the following tutorials for Kaggle's titanic competition: https://www.kaggle.com/mlchang/titanic/logistic-model-using-scikit-learn/run/91385 https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests https://github.com/savarin/pyconuk-introtutorial/tree/master/notebooks