ESG-Detergent-Pipeline-PySpark

This repository contains a data pipeline developed using PySpark in Databricks to analyze the environmental impact of detergents based on sales data. The pipeline processes ESG (Environmental, Social, and Governance) data to provide insights into the environmental footprint of different detergent example products.

Introduction

The project describes a simple ETL data pipeline from a CSV file, and randomly generated ESG data about 8 detergents and their CO2 emissions, and randomly generated sales data of the detergents in different countries. The CSV file must first be stored as a table in the Databricks data catalog. (Data source: https://www.kaggle.com/datasets/nelgiriyewithana/countries-of-the-world-2023)

The pipeline generates 2 results in relation to the CO2 footprint caused by detergents:

CO2 footprint per article per country
CO2 footprint per country relative to total CO2 footprint per country

Other results could be evaluated, such as CO2 footprint by month and year and per country and per article. But for demonstration purposes, only the 2 above use cases were evaluated.

The detergents data and sales data are related to each other in a star schema structure: detergents table (dimension table) for sales table (facts table). (https://www.databricks.com/blog/2022/05/20/five-simple-steps-for-implementing-a-star-schema-in-databricks-with-delta-lake.html)

Procedural programming was used, because it is mostly structured, reusable and logically encapsulated. Object-oriented would have made less sense, since many individual components work together and should ultimately form a larger whole.

For production usage, the following things should still be changed:

Put basic core code in a Python repository, and package it, and only call main() in Databricks. This allows changes to be tracked and versioned. Repos can have restricted access (not everyone can simply change production code in the notebook). And code can be used in other notebooks.
DB credentials not in code, but in secrets manager or key vault
Data pipeline can generally be triggered (e.g. by Data Factory)
Serverless functions can also be used
As a follow-up measure, the DevOps lifecycle could be defined:
- Development in VS Code, remote access to test cluster in Databricks
- Repositories in Azure DevOps (or GitHub) for versioning the artifacts
- Production code linked to Databricks repos
- Notebooks triggered by Data Factory

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
ESG Pipeline Notebook PySpark.ipynb		ESG Pipeline Notebook PySpark.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ESG-Detergent-Pipeline-PySpark

Introduction

About

Uh oh!

Releases

Packages

Languages

License

MBlinkCloudDev/ESG-Detergent-Pipeline-PySpark

Folders and files

Latest commit

History

Repository files navigation

ESG-Detergent-Pipeline-PySpark

Introduction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages