The idea was to create an end-to-end automated data pipeline that can be used for analysis and answer common questions related to youtube data such as what are the most viewed channels among the 3 regions, the most liked and disliked videos, average comments on the videos and many more.
Technology stack: Python, Snowflake, Prefect
The workflow is divided into 3 main parts ETL(Extract, Load, Transform):
- Extract: The dataset is extracted from Kaggle Website and stored in a folder.
- Transform: Here the dataset is divided into 3 continents Asia,Europe,North America (which act as the dimension table).
According to the country's location the data is appended to it's respective continent.Then the dataset is cleaned(handling null values, getting rid of unwanted values, making sure the dataset has uniform datatype,and many more) for analysis.After creating dimension table, fact table is created which acts as a bridge table to join all the dimension tables(using surrogate key) along with surrogate key new columns are added:
- eu_video_interaction_rate
- na_video_interaction_rate
- as_video_interaction_rate
- Load: The cleaned datasets are then loaded into snowflake(virtual data warehouse) using snowflake connector for python. For connection and authentication I used Key Pair Authentication & Key Pair Rotation provided by snowflake. You can read more about it here.
Once data reaches snowflake, a dashboard is created using snowsight for data analysis and visualisation. The data is analysed using SQL queries.
This entire process is automated using Prefect and scheduled to occur everyday.