Formula1-Data-Pipeline

An end-to-end formula1 data pipeline built with Azure Databricks, Azure Data Factory, and Apache Spark.

Architecture

Data Flow

Ingestion
- For each F1 race, a new folder is created in the raw data layer, containing race-specific files and folders.
- An ADF trigger schedules and executes a notebook pipeline, moving the raw data to the processed layer in Delta Lake.
Transformation
- After the raw data is processed, a transformation pipeline is initiated.
- The transformation notebooks join, filter, and store the processed data in the presentation layer of the Delta Lake.
Analysis
- Data can be queried using SparkSQL or the DataFrame API, enabling flexible data exploration.
- Visualization tools like Power BI can connect to the Delta Lake to generate dashboards and reports.

Storage and Files

The raw data is stored in a folder labeled with the date of the race. The contents of the folder are listed below.

Drivers, constructors, races, and circuits are processed using a full-load approach, where the data is overwritten in each iteration.
Results, pit stops, lap times, and qualifying are processed incrementally, with the data being cleaned, formatted, and merged with the existing records.
Therefore, this solution employs a hybrid data ingestion approach.

Data Factory Pipelines

Main Pipeline

Ingestion Pipeline

Debugging the pipeline:

Activities:

Get Metadata
- Connects to blob storage using a linked service to verify whether a folder for the current run exists in the raw container.
- The output includes an Exists flag.
If Condition
- Uses the output from the Get Metadata activity.
- If the output.exists flag is true, the following pipeline is executed. All files are available here.
- If the flag is false, a Logic App is triggered to send a failure email to a specific email address.
- Error Email

Transformation Pipeline

Activities:

Databricks Notebook
- Run the race_results Databricks notebook to create the race_results Delta table in the presentation layer.
- This table is generated by joining and filtering data from four processed layer tables: races, circuits, drivers, and constructors.
- The information in this table can be utilized to display a dashboard of results after each race.
Databricks Notebook
- Run the driver_standings Databricks notebook to generate the driver_standings Delta table in the presentation layer.
- Since the driver standings are derived from the race_results table, this process should only be executed after the successful completion of the race results transformation.

Trigger

A tumbling window trigger with 168hrs/2w interval, end date specified and max concurrency set to 1.

Parameter configuration:

{
  "p_window_end_date": "@trigger().outputs.windowEndTime"
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
images		images
ingestion		ingestion
set-up		set-up
transform		transform
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formula1-Data-Pipeline

Table of Contents

Architecture

Data Flow

Storage and Files

Data Factory Pipelines

Main Pipeline

Ingestion Pipeline

Activities:

Transformation Pipeline

Activities:

Trigger

Triggered Execution (Video)

About

Languages

pratik-choudhari/Formula1-Data-Pipleline

Folders and files

Latest commit

History

Repository files navigation

Formula1-Data-Pipeline

Table of Contents

Architecture

Data Flow

Storage and Files

Data Factory Pipelines

Main Pipeline

Ingestion Pipeline

Activities:

Transformation Pipeline

Activities:

Trigger

Triggered Execution (Video)

About

Topics

Resources

Stars

Watchers

Forks

Languages