This project is developed as part of the Data Engineering bootcamp at Artefact School of Data. It involves retrieving and processing data from multiple sources related to Paris transportation and weather. We implemented a data engineering flow using modern data stack principles:
- Google Cloud Platform (GCP) for infrastructure
- BigQuery for data storage and retrieval
- Apache Airflow for scheduling and orchestration of ETL (Extract, Transform, Load) processes.
- DBT (Data building Tool) core for data transformation using SQL
- Project Overview
- Data Sources
- Architecture
- Prerequisites
- Installation and Setup
- Airflow Configuration
- Running the Project
- Accessing Airflow
The goal of this project is to build a data pipeline that collects, processes, and stores data from the following sources:
- Metro Lines Perturbations: Data on metro line disruptions from the Plateforme Régionale d'Information pour la Mobilité (PRIM).
- Vélib' Stations: Real-time data of Vélib' (Paris public bike rental service) stations.
- Weather Conditions: Weather data from Météo France.
- PRIM API: Provides information on public transportation disruptions.
- Vélib' Open Data API: Offers real-time availability of bikes and stations.
- Météo France API: Supplies weather forecasts and conditions.
The project uses the following technologies and services:
- Google Cloud Platform (GCP): Hosts the infrastructure.
- Google Compute Engine: Runs the virtual machine for Airflow.
- Google BigQuery: Stores the processed data.
- Apache Airflow: Manages and orchestrates the ETL workflows.
- Pulumi: Infrastructure as code tool used for provisioning resources on GCP.
- Python: The primary programming language for the ETL processes.
- Python 3.10 or higher
- Google Cloud SDK installed and configured
- Pulumi CLI installed
- Git installed on your local machine
- Make utility installed
- Account on PRIM to obtain an API token
git clone https://github.com/your_username/your_repository.git
cd your_repository
pip install -r requirements.in
Go to the Google Cloud Console. Create a new project or use an existing one. Note down the Project ID.
At the root of the project, create a .env file (this file is not versioned):
echo "PROJECT_ID=your-gcp-project-id" > .env
Replace your-gcp-project-id with your actual GCP project ID.
Copy the infra/Pulumi.yaml.example
into infra/Pulumi.yaml
Open Pulumi.yaml and update the following fields:
config:
gcp:project: your-gcp-project-id
gcp:region: your-region
gcp:zone: your-zone
Replace your-gcp-project-id, your-region, and your-zone with your GCP project details.
make pulumi-install
Use the Makefile to deploy the infrastructure:
make infra-up
This command uses Pulumi to create all necessary resources on GCP.
SSH into the VM:
gcloud compute ssh your-vm-instance-name --project=your-gcp-project-id
Replace your-vm-instance-name with the name of your VM instance.
Run the Installation Script:
sudo /opt/install.sh
This script installs all software dependencies required on the VM.
Clone the Project Repository on the VM:
gh repo clone pierrealexandre78/de_project
git clone https://github.com/pierrealexandre78/de_project.git
cd your_repository
Copy Configuration Files:
cp /opt/gcp_config.json src/config/
cp airflow/.env_example airflow/.env
Install the ETL Module:
pip install -e .
Start Airflow:
make run_airflow
This command starts the Airflow web server and scheduler.
Airflow Web Interface:
Open your web browser and navigate to http://your-vm-external-ip:8080. Log in using the default credentials (if set) or your configured username and password.
Set Up Airflow Variables and Connections:
PRIM API Token: You'll need to create an account on PRIM to obtain an API token.
Log in to the Airflow web interface.
Navigate to Admin > Variables.
Add a new variable:
Key: PRIM_API_KEY Value: Your PRIM API TOKEN
PRIM API Account: Don't forget to create an account on PRIM and obtain your API token. This is essential for accessing the metro lines perturbation data. API Keys and Tokens: Keep your API keys and tokens secure. Do not commit them to version control. Environment Files: The .env files should not be versioned (.gitignore them) to prevent sensitive data from being exposed.