Skip to content

Created an optimised pipeline to provide accurate data for analysis, then used snowsight (provided by Snowflake) to create a dashboard.

Notifications You must be signed in to change notification settings

Nupurgopali/youtube_data_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

youtube_data_analysis

AIM

The idea was to create an end-to-end automated data pipeline that can be used for analysis and answer common questions related to youtube data such as what are the most viewed channels among the 3 regions, the most liked and disliked videos, average comments on the videos and many more.

Technology stack: Python, Snowflake, Prefect

Worflow

The workflow is divided into 3 main parts ETL(Extract, Load, Transform):

  1. Extract: The dataset is extracted from Kaggle Website and stored in a folder.
  2. Transform: Here the dataset is divided into 3 continents Asia,Europe,North America (which act as the dimension table). According to the country's location the data is appended to it's respective continent.Then the dataset is cleaned(handling null values, getting rid of unwanted values, making sure the dataset has uniform datatype,and many more) for analysis.After creating dimension table, fact table is created which acts as a bridge table to join all the dimension tables(using surrogate key) along with surrogate key new columns are added:
    • eu_video_interaction_rate
    • na_video_interaction_rate
    • as_video_interaction_rate
    These columns give an overview about the most interacted videos in each region using views,dislikes and likes rates.
  3. Load: The cleaned datasets are then loaded into snowflake(virtual data warehouse) using snowflake connector for python. For connection and authentication I used Key Pair Authentication & Key Pair Rotation provided by snowflake. You can read more about it here.

Once data reaches snowflake, a dashboard is created using snowsight for data analysis and visualisation. The data is analysed using SQL queries.

This entire process is automated using Prefect and scheduled to occur everyday.

Snapshot of Dashboard

image

About

Created an optimised pipeline to provide accurate data for analysis, then used snowsight (provided by Snowflake) to create a dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages