Materials for "Docker for Data Science" tutorial presented at PyCon 2018 in Cleveland, OH.
Jupyter notebooks simplify the process of developing and sharing Data Science projects across groups and organizations. However, when we want to deploy our work into production, we need to extract the model from the notebook and package it up with the required artifacts (data, dependencies, configurations, etc) to ensure it works in other environments. Containerization technologies such as Docker can be used to streamline this workflow.
This hands-on tutorial presents Docker in the context of Reproducible Data Science - from idea to application deployment. You will get a thorough introduction to the world of containers; learn how to incorporate Docker into various Data Science projects; and walk through the process of building a Machine Learning model in Jupyter and deploying it as a containerized Flask REST API.
This session is geared towards Data Scientists who are interested in learning about Docker and want to understand how to incorporate it in their projects. No prior knowledge of Docker is assumed. Proficiency with Git and the Command Line is not a prerequisite, but will make it easier to follow along.
Upon completion of this tutorial, students will be able to:
- Navigate the Docker ecosystem with ease
- Leverage containers as part of their data science workflow
- Productionize & deploy a Machine Learning model wrapped in an API
Learn how to become a Full-Stack Data Scientist!
-
Download Docker for Mac. Contains both Docker and Docker-Compose.
-
Install
-
Update your package manager.
-
Use package manager to install Docker.
-
Use package manager to install Docker-Compose.
Might need to add user account to docker
group.
Note: Windows 10 users can use the Linux subsystem to install Docker and Docker-Compose. Instructions from a post we found on Medium.
Please also make sure to install Docker-Compose when you are installing Docker. Then proceed to Step 2
Otherwise, we have created a VM image. USB sticks with the image will be available at the tutorial
-
Download VirtualBox for Windows Hosts.
-
Download VirtualBox image containing all required files and containers. We also have USB sticks containing these images to reduce strain on the conference WiFi.
-
Open VirtualBox Manager.
-
File > Import Applicance > point to the file you just downloaded. Import it in.
-
Double-click VM to start an instance.
-
Login:
osboxes
| Password:osboxes.org
| Root password:osboxes.org
The image you download contains images as well as repositories that were cloned to ~/docker-for-data-science
.
- Update cloned repos by going into each folder and doing a
git pull
. Skip Steps 2 and 3.
-
Create a folder for this tutorial, we recommend
~/docker-for-data-science
as this will be the folder we use in all of our examples. -
cd
into folder -
Download both repositories:
git clone https://github.com/docker-for-data-science/docker-for-data-science-tutorial.git
git clone https://github.com/docker-for-data-science/talkvoter.git
Please pre-download Docker images to reduce the strain on the conference WiFi.
-
cd ~/docker-for-data-science/docker-for-data-science-tutorial/installation_files
-
Run the shell script:
./download_docker_images.sh
-
Build images for Talk Recommendation application:
cd ~/docker-for-data-science/talkvoter
docker-compose build