Skip to content

Latest commit

 

History

History
67 lines (47 loc) · 2.72 KB

README.md

File metadata and controls

67 lines (47 loc) · 2.72 KB

StarCraft 2 Data Pipeline with Airflow, DuckDB and Streamlit

DAG Logo based on StarCraft II press kit

This project is not only a source for knowledge sharing but also a celebration of my love for gaming and the endless possibilities that data holds. It is a fusion of our two great passions: gaming and Data Engineering. I grew up playing StarCraft: Brood War as well as StarCraft II a lot. I never made it to the grandmaster ladder, but I enjoyed every match experiencing the adrenaline rush of commanding armies, outmanoeuvring opponents, and claiming victory (from time to time at least).

Just as I fine-tuned my build orders and adapt to enemy tactics in StarCraft, I am now optimizing data pipelines, analyze trends, and visualize insights as a Data Engineer. In this project I would like to share knowledge about three useful technologies of today's modern data stacks, which are namely:

  • ⏱️ Apache Airflow: Platform for orchestrating and scheduling complex workflows.
  • 🦆 DuckDB: Lightweight and versatile analytical database.
  • 🚀 Streamlit: User-friendly framework for building interactive web applications.

This project is basically a StarCraft II data pipeline, where data is fetched from the StarCraft II API and stored in DuckDB, orchestrated via Airflow. There is also a Streamlit app to visualize how the current grandmaster ladder in StarCraft II looks like (spoiler: you will not find me in it).

In playground.py you find more examples how to utilize DuckDB and Pandas to analyze the data.

You can find more in my article: https://vojay.de/2024/03/14/starcraft-data-pipeline/

This is the final result:

Airflow DAG

DAG

Streamlit Dashboard

Dashboard

Setup notes

# Create venv
python -m venv .venv
source .venv/bin/activate

# Install Airflow
AIRFLOW_VERSION=2.8.2
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

# Run Airflow to initialize the database
NO_PROXY="*" AIRFLOW_HOME="$(pwd)/airflow" airflow standalone

# Install DuckDB
pip install duckdb

# Install Pandas and PyArrow
pip install pandas
pip install pyarrow

# Install Streamlit
pip install streamlit