An end-to-end formula1 data pipeline built with Azure Databricks, Azure Data Factory, and Apache Spark.
-
Ingestion
- For each F1 race, a new folder is created in the raw data layer, containing race-specific files and folders.
- An ADF trigger schedules and executes a notebook pipeline, moving the raw data to the processed layer in Delta Lake.
-
Transformation
- After the raw data is processed, a transformation pipeline is initiated.
- The transformation notebooks join, filter, and store the processed data in the presentation layer of the Delta Lake.
-
Analysis
- Data can be queried using SparkSQL or the DataFrame API, enabling flexible data exploration.
- Visualization tools like Power BI can connect to the Delta Lake to generate dashboards and reports.
The raw data is stored in a folder labeled with the date of the race. The contents of the folder are listed below.
- Drivers, constructors, races, and circuits are processed using a full-load approach, where the data is overwritten in each iteration.
- Results, pit stops, lap times, and qualifying are processed incrementally, with the data being cleaned, formatted, and merged with the existing records.
- Therefore, this solution employs a hybrid data ingestion approach.
Debugging the pipeline:
-
Get Metadata
- Connects to blob storage using a linked service to verify whether a folder for the current run exists in the raw container.
- The output includes an
Exists
flag.
-
If Condition
-
Uses the output from the Get Metadata activity.
-
If the
output.exists
flag is true, the following pipeline is executed. All files are available here.
-
If the flag is false, a Logic App is triggered to send a failure email to a specific email address.
-
- Databricks Notebook
- Run the race_results Databricks notebook to create the
race_results
Delta table in the presentation layer. - This table is generated by joining and filtering data from four processed layer tables: races, circuits, drivers, and constructors.
- The information in this table can be utilized to display a dashboard of results after each race.
- Run the race_results Databricks notebook to create the
- Databricks Notebook
- Run the driver_standings Databricks notebook to generate the
driver_standings
Delta table in the presentation layer. - Since the driver standings are derived from the
race_results
table, this process should only be executed after the successful completion of the race results transformation.
- Run the driver_standings Databricks notebook to generate the
A tumbling window trigger with 168hrs/2w interval, end date specified and max concurrency set to 1.
Parameter configuration:
{
"p_window_end_date": "@trigger().outputs.windowEndTime"
}