Skip to content

Data engineering final project at Artefact. Extract Transform Load (ETL) with Google Bigquery, Apache Airflow and dbt

License

Notifications You must be signed in to change notification settings

pierrealexandre78/parisian_mobility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Project : Parisian Mobility

This project is developed as part of the Data Engineering bootcamp at Artefact School of Data. It involves retrieving and processing data from multiple sources related to Paris transportation and weather. We implemented a data engineering flow using modern data stack principles:

  • Google Cloud Platform (GCP) for infrastructure
  • BigQuery for data storage and retrieval
  • Apache Airflow for scheduling and orchestration of ETL (Extract, Transform, Load) processes.
  • DBT (Data building Tool) core for data transformation using SQL

Table of Contents

Project Overview

The goal of this project is to build a data pipeline that collects, processes, and stores data from the following sources:

  • Metro Lines Perturbations: Data on metro line disruptions from the Plateforme Régionale d'Information pour la Mobilité (PRIM).
  • Vélib' Stations: Real-time data of Vélib' (Paris public bike rental service) stations.
  • Weather Conditions: Weather data from Météo France.

Data Sources

  1. PRIM API: Provides information on public transportation disruptions.
  2. Vélib' Open Data API: Offers real-time availability of bikes and stations.
  3. Météo France API: Supplies weather forecasts and conditions.

Architecture

The project uses the following technologies and services:

  • Google Cloud Platform (GCP): Hosts the infrastructure.
  • Google Compute Engine: Runs the virtual machine for Airflow.
  • Google BigQuery: Stores the processed data.
  • Apache Airflow: Manages and orchestrates the ETL workflows.
  • Pulumi: Infrastructure as code tool used for provisioning resources on GCP.
  • Python: The primary programming language for the ETL processes.

Prerequisites

  • Python 3.10 or higher
  • Google Cloud SDK installed and configured
  • Pulumi CLI installed
  • Git installed on your local machine
  • Make utility installed
  • Account on PRIM to obtain an API token

Installation and Setup

1. Clone the Repository

git clone https://github.com/your_username/your_repository.git
cd your_repository

2. Install Python Dependencies

pip install -r requirements.in

3. Create a Google Cloud Project

Go to the Google Cloud Console. Create a new project or use an existing one. Note down the Project ID.

4. Configure Environment Variables

At the root of the project, create a .env file (this file is not versioned):

echo "PROJECT_ID=your-gcp-project-id" > .env

Replace your-gcp-project-id with your actual GCP project ID.

5. Update Pulumi Infrastructure Configuration

Copy the infra/Pulumi.yaml.example into infra/Pulumi.yaml Open Pulumi.yaml and update the following fields:

config:
  gcp:project: your-gcp-project-id
  gcp:region: your-region
  gcp:zone: your-zone

Replace your-gcp-project-id, your-region, and your-zone with your GCP project details.

make pulumi-install

6. Deploy Infrastructure to GCP

Use the Makefile to deploy the infrastructure:

make infra-up

This command uses Pulumi to create all necessary resources on GCP.

7. Set Up the Compute Engine VM

SSH into the VM:

gcloud compute ssh your-vm-instance-name --project=your-gcp-project-id

Replace your-vm-instance-name with the name of your VM instance.

Run the Installation Script:

sudo /opt/install.sh

This script installs all software dependencies required on the VM.

Clone the Project Repository on the VM:

gh repo clone pierrealexandre78/de_project
git clone https://github.com/pierrealexandre78/de_project.git
cd your_repository

Copy Configuration Files:

cp /opt/gcp_config.json src/config/
cp airflow/.env_example airflow/.env

Install the ETL Module:

pip install -e .  

Running the Project

Start Airflow:

make run_airflow

This command starts the Airflow web server and scheduler.

Accessing Airflow

Airflow Web Interface:

Open your web browser and navigate to http://your-vm-external-ip:8080. Log in using the default credentials (if set) or your configured username and password.

Airflow Configuration

Set Up Airflow Variables and Connections:

PRIM API Token: You'll need to create an account on PRIM to obtain an API token.

Log in to the Airflow web interface.

Navigate to Admin > Variables.

Add a new variable:

Key: PRIM_API_KEY Value: Your PRIM API TOKEN

Important Notes

PRIM API Account: Don't forget to create an account on PRIM and obtain your API token. This is essential for accessing the metro lines perturbation data. API Keys and Tokens: Keep your API keys and tokens secure. Do not commit them to version control. Environment Files: The .env files should not be versioned (.gitignore them) to prevent sensitive data from being exposed.

About

Data engineering final project at Artefact. Extract Transform Load (ETL) with Google Bigquery, Apache Airflow and dbt

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published