Skip to content

cbfacademy/Data-Pipeline-Testing-and-Monitoring

Repository files navigation

🎵 Chinook Music Store - dbt Testing & Monitoring Labs

A hands-on dbt project for learning data pipeline testing and monitoring using the complete Chinook database with real production-like data.

📋 Course Overview

This repository accompanies Module 4: Data Pipeline Testing and Monitoring and contains 8 hands-on labs across 4 decks:

Deck Topic Labs
1 Introduction to Testing Tools Lab 1.1, Lab 1.2
2 Testing Data Pipelines Lab 2.1, Lab 2.2
3 Writing Unit and Integration Tests Lab 3.1, Lab 3.2
4 Monitoring and Maintenance Lab 4.1, Lab 4.2

🎯 What You'll Learn

  • ✅ Connect dbt to real data in BigQuery
  • ✅ Write schema tests (unique, not_null, accepted_values, relationships)
  • ✅ Create singular tests for complex business logic
  • ✅ Build custom generic tests (reusable macros)
  • ✅ Implement unit tests for transformation logic
  • ✅ Set up integration tests across models
  • ✅ Configure monitoring and alerts
  • ✅ Debug pipeline failures systematically

🗄️ The Chinook Database - Real Production Data

The Chinook database represents a digital music store (like iTunes) with real, complete data:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│     Artist      │────▶│      Album      │────▶│      Track      │
│   275 records   │     │   347 records   │     │  3,503 records  │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                       │
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│    Customer     │────▶│     Invoice     │────▶│   InvoiceLine   │
│    59 records   │     │   412 records   │     │  2,240 records  │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Total: 15,000+ records across 11 tables - This is real-world scale data!

Tables Included:

Table Records Description
Artist 275 Music artists and bands
Album 347 Albums linked to artists
Track 3,503 Individual songs with pricing
Genre 25 Music genres
MediaType 5 File format types
Customer 59 Customer information
Employee 8 Sales representatives
Invoice 412 Purchase transactions
InvoiceLine 2,240 Line items per invoice
Playlist 18 Music playlists
PlaylistTrack 8,715 Playlist-track associations

🚀 Getting Started

Prerequisites

Before starting, ensure you have:

  • Google Cloud account with a GCP project
  • BigQuery API enabled in your project
  • Python 3.8+ installed
  • Google Cloud SDK installed (gcloud)

Step 1: Clone the Repository

git clone https://github.com/your-org/chinook-dbt-testing-labs.git
cd chinook-dbt-testing-labs

Step 2: Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate it
# On Mac/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Step 3: Authenticate with Google Cloud

# Login to Google Cloud
gcloud auth login

# Set your project
gcloud config set project YOUR_PROJECT_ID

# Create application default credentials
gcloud auth application-default login

Step 4: Load Chinook Data to BigQuery

This is the key step! Run our data loader to populate your BigQuery project with the complete Chinook database:

python scripts/load_chinook_to_bigquery.py --project YOUR_PROJECT_ID

This script will:

  1. Download the official Chinook database
  2. Create a chinook_raw dataset in your BigQuery project
  3. Load all 11 tables with complete data
  4. Verify the load was successful

Expected output:

🎵 Chinook Database Loader for BigQuery
==========================================
📥 Downloading Chinook database...
   ✅ Downloaded

📊 Loading tables to BigQuery...
   ✅ Artist: 275 rows loaded
   ✅ Album: 347 rows loaded
   ✅ Track: 3,503 rows loaded
   ...

🎉 SUCCESS! Chinook database loaded to BigQuery

Step 5: Configure dbt Profile

Create or edit ~/.dbt/profiles.yml:

chinook_testing:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: YOUR_PROJECT_ID      # <-- Your GCP project ID
      dataset: chinook_dev          # dbt creates this for transformed models
      location: US
      threads: 4
      timeout_seconds: 300

Step 6: Verify Everything Works

# Test the connection
dbt debug

# Install dbt packages
dbt deps

# Build all models
dbt run

# Run all tests
dbt test

You should see:

Completed successfully
Done. PASS=X WARN=0 ERROR=0

🎉 You're ready to start the labs!


📁 Project Structure

chinook-dbt-testing-labs/
├── README.md                    # This file
├── dbt_project.yml              # dbt project configuration
├── packages.yml                 # dbt package dependencies
├── requirements.txt             # Python dependencies
│
├── scripts/
│   ├── load_chinook_to_bigquery.py  # ⬅️ Run this first!
│   ├── setup.sh                 # Quick setup script
│   └── run_tests_with_retry.sh  # Test runner with retries
│
├── models/
│   ├── staging/                 # Source → Staging transformations
│   │   ├── _stg_chinook.yml     # Source definitions & tests
│   │   ├── stg_artists.sql
│   │   ├── stg_albums.sql
│   │   ├── stg_tracks.sql
│   │   ├── stg_genres.sql
│   │   ├── stg_media_types.sql
│   │   ├── stg_customers.sql
│   │   ├── stg_employees.sql
│   │   ├── stg_invoices.sql
│   │   └── stg_invoice_lines.sql
│   │
│   ├── intermediate/            # Business logic transformations
│   │   ├── _int_chinook.yml
│   │   ├── int_tracks_enriched.sql
│   │   └── int_invoice_totals.sql
│   │
│   ├── marts/                   # Analytics-ready models
│   │   ├── _marts_chinook.yml
│   │   ├── dim_customers.sql
│   │   ├── dim_tracks.sql
│   │   └── fct_sales.sql
│   │
│   └── unit_tests/              # Unit test models
│       └── ...
│
├── tests/                       # Test SQL files
│   ├── assert_*.sql             # Singular tests
│   ├── integration/             # Integration tests
│   ├── monitoring/              # Monitoring tests
│   └── unit_tests/              # Unit test assertions
│
├── macros/tests/                # Custom generic tests
│   ├── test_is_positive.sql
│   ├── test_valid_email.sql
│   └── test_within_range.sql
│
├── seeds/                       # Test fixtures only (not source data)
│   ├── test_tracks_input.csv
│   └── test_tracks_expected.csv
│
├── labs/                        # Lab instructions
│   ├── deck1/
│   │   ├── LAB_1_1_explore_chinook.md
│   │   └── LAB_1_2_first_tests.md
│   ├── deck2/
│   │   ├── LAB_2_1_schema_tests.md
│   │   └── LAB_2_2_singular_tests.md
│   ├── deck3/
│   │   ├── LAB_3_1_unit_tests.md
│   │   └── LAB_3_2_integration_tests.md
│   └── deck4/
│       ├── LAB_4_1_monitoring_setup.md
│       └── LAB_4_2_debugging_failures.md
│
└── analyses/                    # Ad-hoc debug queries
    └── debug_invoice_issues.sql

🧪 Labs Overview

Deck 1: Introduction to Testing Tools

  • Lab 1.1: Explore Chinook & Build Your First Models
  • Lab 1.2: Write Your First dbt Tests

Deck 2: Testing Data Pipelines

  • Lab 2.1: Master Schema Tests (unique, not_null, relationships)
  • Lab 2.2: Create Singular Tests for Business Logic

Deck 3: Writing Unit and Integration Tests

  • Lab 3.1: Build Unit Tests with Test Fixtures
  • Lab 3.2: Implement Integration Tests Across Models

Deck 4: Monitoring and Maintenance

  • Lab 4.1: Set Up Freshness Monitoring & Alerts
  • Lab 4.2: Debug Pipeline Failures Systematically

🏃 Quick Commands

# Load data to BigQuery (run once)
python scripts/load_chinook_to_bigquery.py --project YOUR_PROJECT_ID

# Build all models
dbt run

# Run all tests
dbt test

# Build and test together
dbt build

# Run specific model
dbt run --select stg_customers

# Test specific model
dbt test --select stg_customers

# Run only schema tests
dbt test --select test_type:schema

# Run only singular tests
dbt test --select test_type:singular

# Generate and view documentation
dbt docs generate && dbt docs serve

🆘 Troubleshooting

"Permission Denied" when loading data

# Make sure you're authenticated
gcloud auth application-default login

"Dataset not found" error

The loader creates chinook_raw dataset. Make sure you ran:

python scripts/load_chinook_to_bigquery.py --project YOUR_PROJECT_ID

"Source not found" error in dbt

Check that your profile points to the correct project where you loaded the data.

Tests failing unexpectedly

# Store failures for investigation
dbt test --store-failures

# Then query the failure table in BigQuery

📚 Additional Resources


📝 License

This project is for educational purposes as part of the CBF Data Engineering curriculum.


Happy Testing! 🎉

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages