Data Engineering Project

In this project, I developed a reliable, scalable and maintainable batch-processing data system, focussing on the data engineering architecture and setup. It uses Airflow to orchestrate a machine learning pipeline, which is executed automatically once a day, training, evaluating, fitting and outputting the best model it computes for detecting fradulent credit card transactions within today's batch of data. For traceablitiy, the data of every run is persistently saved in a PostgreSQL database.
To assess the performance of today's model, the evaluation criteria are visualized over time using Dash. To ensure a clean environment and reproducibility, the project is set up using Docker container.

(It should be noted that the model itself is not completely optimized as the focus of this project is data engineering. Accuracy is just added for aesthetic reasons, its meaning is limited with such inbalanced dataset.)

To reproduce this project, follow these simple steps:

Have Docker installed on your system
Get the dataset from kaggle and paste the creditcard.csv in the data directory
Navigate to the folder of this readme in your terminal
Initialize Airflow with the command "docker compose up airflow-init"
Once the initialization has fininished, enter "docker compose up" to actually start the project
Wait until all services except the Dash container are healthy - check with the command "docker ps" (Dash cannot be healthy at first startup as it does not find any data in the database)
Enter "localhost:8080" in your preferred browser (like you would enter a website) to open the Airflow GUI
Enter "airflow" as username and password
Go to Admin --> Connections --> Add a new record and enter the following:
Connection Id = postgres_default
Connection Type = Postgres
Host = postgres
Schema = airflow
Login = airflow
Password = airflow
Then save connection
Once this is done, you can go to DAGs, select ml_pipeline and then start the process on the top right. Make sure to have at least 4GB of RAM available. If a task fails with return code Negsignal.SIGKILL, it is due to too little RAM available.
To view the performance of the model, go to localhost:80, which is the port Dash is mapped to.

The final models are outputted in /models (created in the process), /data includes the datasets of the latest run as .csv. /dags/ml_pipeline.py is the file which creates the directed acyclic graph for Airflow while /dags/helpfiles includes all pipeline steps used in ml_pipeline.py. /app_dash contains all relevant data regarding the Dash visualization.

Airflow's webserver offers many possibilities of tracking your pipeline performance and status, check out the graph or Gantt chart!

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
app_dash		app_dash
dags		dags
data		data
.env		.env
.gitignore		.gitignore
Dash.JPG		Dash.JPG
Dockerfile		Dockerfile
README.md		README.md
architecture.jpg		architecture.jpg
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Project

About

Releases

Packages

Languages

Paul1911/Data-Engineering-Project

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages