A comprehensive, production-ready repository demonstrating Databricks Asset Bundles (DAB) with complete infrastructure-as-code deployment on Azure using Terraform and GitHub Actions.
- Overview
- What are Databricks Asset Bundles?
- Why Use DAB?
- Repository Structure
- Prerequisites
- Quick Start
- Deployment Guide
- DAB Examples
- Old Way vs DAB Way
- Architecture
- Troubleshooting
- Contributing
This repository provides a complete, end-to-end implementation of Databricks Asset Bundles (DAB) including:
- ✅ Complete Azure Infrastructure - Terraform code to deploy Databricks workspace on Azure
- ✅ Two Production-Ready DAB Examples - ETL Pipeline and ML Training workflows
- ✅ Automated CI/CD - GitHub Actions for both infrastructure and DAB deployments
- ✅ Multi-Environment Support - Dev and Prod configurations with environment parity
- ✅ Security Best Practices - Azure Service Principal authentication, secrets management
- ✅ Comprehensive Documentation - Setup guides, architecture diagrams, and tutorials
Databricks Asset Bundles (DAB) is a deployment framework that enables Infrastructure-as-Code (IaC) for Databricks jobs, workflows, Delta Live Tables, and other workspace resources.
| Feature | Description |
|---|---|
| Version Control | All job configurations, notebooks, and code in Git |
| Environment Management | Deploy to dev, staging, prod with guaranteed parity |
| CI/CD Integration | Native GitHub Actions, GitLab CI, Azure DevOps support |
| Validation | Built-in validation before deployment |
| State Management | Automatic tracking of deployed resources |
| Rollback | Easy rollback via Git revert |
1. Create jobs manually in Databricks UI
2. Copy-paste configurations between environments
3. No version control of job configurations
4. Manual parameter updates across multiple jobs
5. Configuration drift between dev and prod
6. No automated testing or validation
7. Difficult team collaboration
8. No rollback capability
# One configuration file
resources:
jobs:
etl_pipeline:
name: "ETL Pipeline - ${bundle.target}"
tasks:
- task_key: extract
notebook_task:
notebook_path: ../src/etl/extract.py
# Deploy anywhere with one command
$ databricks bundle deploy -t dev # Deploy to dev
$ databricks bundle deploy -t prod # Deploy to prodBenefits:
- ✅ Version controlled in Git
- ✅ Environment parity guaranteed
- ✅ Code review process for job changes
- ✅ Automated testing and validation
- ✅ Easy rollback (git revert + redeploy)
- ✅ Team collaboration built-in
See Old Way vs DAB Way for detailed comparison.
databricks-dab-lab/
├── .github/
│ └── workflows/
│ ├── terraform-deploy.yml # Infrastructure deployment pipeline
│ └── dab-deploy.yml # DAB deployment pipeline
├── terraform/ # Azure infrastructure as code
│ ├── main.tf # Provider configuration
│ ├── variables.tf # Input variables
│ ├── resources.tf # Databricks workspace & resources
│ ├── data.tf # Data sources
│ ├── outputs.tf # Output values
│ ├── locals.tf # Local values & naming conventions
│ └── terraform.tfvars.example # Example variables file
├── src/ # Source code for DAB jobs
│ ├── setup/ # Setup scripts
│ │ └── create_sample_data.py # Sample data generation
│ ├── etl_pipeline/ # ETL job notebooks
│ │ ├── extract.py # Data extraction
│ │ ├── transform.py # Data transformation
│ │ ├── load.py # Data loading
│ │ └── validate.py # Data quality validation
│ └── ml_training/ # ML training notebooks
│ ├── prepare_data.py # Feature engineering
│ ├── train_model.py # Model training
│ ├── evaluate_model.py # Model evaluation
│ └── register_model.py # Model registration
├── resources/ # DAB job configurations
│ ├── setup_job.yml # Setup job definition
│ ├── etl_pipeline_job.yml # ETL job definition
│ └── ml_training_job.yml # ML training job definition
├── notebooks/
│ └── old_approach/ # Documentation of old methods
│ └── manual_job_setup.md # Old way vs DAB comparison
├── scripts/ # Utility scripts
│ ├── setup-github-secrets.sh # Interactive secrets setup
│ └── gh-secrets-commands.md # GitHub CLI commands reference
├── databricks.yml # Main DAB configuration file
├── README.md # This file
└── .gitignore # Git ignore rules
- Azure CLI (v2.40+)
- Terraform (v1.14)
- Databricks CLI (v0.213.0+)
- Git
- GitHub CLI (optional, for secrets setup)
- GitHub Account with Actions enabled
You need an existing Azure infrastructure with:
- Resource Group: Already created (e.g.,
rg-databricks-dab) - Storage Account: For Terraform state (e.g.,
yourbackendstorage) - Container: In the storage account (e.g.,
tfdab) - Service Principal with:
Contributorrole on the Resource GroupStorage Blob Data Contributoron the Storage Account
Note: This project uses existing infrastructure. The Service Principal has limited permissions (Resource Group level only, not subscription-wide) following security best practices.
- Basic understanding of Git and GitHub
- Familiarity with Azure portal
- Basic knowledge of Databricks concepts
- Understanding of YAML syntax
git clone https://github.com/yourghusername/databricks-dab-lab.git
cd databricks-dab-labYou need to configure the following secrets in your GitHub repository:
Required Secrets for Terraform Deployment:
AZURE_SUBSCRIPTION_ID- Your Azure subscription IDAZURE_CLIENT_ID- Service Principal application IDAZURE_CLIENT_SECRET- Service Principal passwordAZURE_TENANT_ID- Your Azure AD tenant IDTF_STATE_RESOURCE_GROUP- Resource group for Terraform stateTF_STATE_STORAGE_ACCOUNT- Storage account for Terraform stateTF_STATE_CONTAINER_NAME- Container for Terraform state files
Required Secrets for DAB Deployment:
DATABRICKS_HOST- Databricks workspace URL (set after Terraform deployment)DATABRICKS_TOKEN- Databricks personal access token (set after Terraform deployment)DATABRICKS_CLUSTER_ID- Cluster ID (set after Terraform deployment)
Option A: Interactive Script (Easiest)
./scripts/setup-github-secrets.shOption B: GitHub CLI Manual Commands
# See scripts/gh-secrets-commands.md for individual commands
gh secret set AZURE_SUBSCRIPTION_ID --body="<your-subscription-id>"
gh secret set AZURE_CLIENT_ID --body="<your-client-id>"
# ... etcOption C: GitHub Web UI
- Go to your repository on GitHub
- Settings → Secrets and variables → Actions
- Click "New repository secret" for each secret
- Go to your repository's Actions tab
- Select Terraform Azure Databricks Deployment workflow
- Click Run workflow
- Select:
- Action:
apply - Auto-approve:
false(recommended for first run)
- Action:
- Click Run workflow
The workflow will:
- Initialize Terraform with remote state
- Validate configuration
- Create execution plan
- Deploy Databricks workspace and cluster
- Create directories and secret scopes
- Output workspace URL and cluster ID
After Terraform deployment completes, you need to set three additional secrets:
4.1 Get Databricks Host URL
Check the Terraform workflow output or run:
cd terraform
terraform output databricks_host
# Example: adb-1234567890123456.7.azuredatabricks.net4.2 Get Cluster ID
From Terraform output:
terraform output databricks_cluster_id
# Example: 1229-221552-7wmjd6ef4.3 Generate Databricks Token
- Open the Databricks workspace URL from step 4.1
- Click your username (top right) → User Settings
- Go to Access Tokens tab
- Click Generate New Token
- Enter a comment (e.g., "GitHub Actions DAB") and lifetime (e.g., 90 days)
- Click Generate
- Copy the token immediately (it won't be shown again)
4.4 Set the Secrets
gh secret set DATABRICKS_HOST --body="<workspace-url>"
gh secret set DATABRICKS_CLUSTER_ID --body="<cluster-id>"
gh secret set DATABRICKS_TOKEN --body="<token-value>"- Go to Actions tab
- Select DAB Deployment workflow
- Click Run workflow
- Select:
- Action:
deploy - Environment:
dev
- Action:
- Click Run workflow
This deploys three jobs to your Databricks workspace:
- Setup Job: Creates sample data
- ETL Pipeline: Data processing workflow
- ML Training Pipeline: Machine learning workflow
Before running the main jobs, create sample data:
Option A: Via Databricks CLI
databricks bundle run setup_sample_data -t devOption B: Via Databricks UI
- Open your Databricks workspace
- Go to Workflows in the left sidebar
- Find "DAB Setup - Create Sample Data - dev"
- Click Run now
This creates:
- Schema:
hive_metastore.dab_lab - Table:
raw_customer_data(1000 sample records)
Run ETL Pipeline:
databricks bundle run etl_pipeline -t devOr via Databricks UI: Workflows → "DAB ETL Pipeline - dev" → Run now
The pipeline will:
- Extract data from
raw_customer_data - Transform and clean the data
- Load to
transformed_dataandfinal_datatables - Validate data quality
Run ML Training Pipeline:
databricks bundle run ml_training -t devOr via Databricks UI: Workflows → "DAB ML Training Pipeline - dev" → Run now
The pipeline will:
- Prepare features from
final_data - Train a classification model
- Evaluate model performance
- Register model to MLflow Model Registry
A complete ETL workflow demonstrating:
- Extract: Read from Delta tables
- Transform: Data cleaning, enrichment, and quality checks
- Load: Write to Delta tables with schema evolution
- Validate: Data quality checks and metrics
Configuration: resources/etl_pipeline_job.yml
Task Flow:
extract → transform → load → validate
Key Features:
- Parameterized inputs/outputs
- Data quality validation
- Error handling and logging
- Schema evolution support
A complete MLOps workflow demonstrating:
- Prepare: Feature engineering and train/test split
- Train: Model training with hyperparameter tuning
- Evaluate: Model performance evaluation
- Register: MLflow Model Registry integration
Configuration: resources/ml_training_job.yml
Task Flow:
prepare_training_data → train_model → evaluate_model → register_model
Key Features:
- MLflow experiment tracking
- Hyperparameter tuning
- Model evaluation metrics
- Automated model registration
- Environment-based deployment (Staging/Production)
# 1. Create notebook in Databricks UI
# 2. Manually configure job via UI:
# - Job name
# - Cluster settings
# - Schedule
# - Parameters
# - Notifications
# 3. Test in dev environment
# 4. Repeat ALL steps manually in prod
# 5. No version control of job configuration
# 6. Hope you didn't miss any settingsProblems:
- ❌ Configuration drift between environments
- ❌ No version control for job definitions
- ❌ Manual errors during replication
- ❌ Difficult to review changes
- ❌ No rollback capability
- ❌ Time-consuming for multiple jobs
# resources/etl_pipeline_job.yml
resources:
jobs:
etl_pipeline:
name: "ETL Pipeline - ${bundle.target}"
tasks:
- task_key: extract
notebook_task:
notebook_path: ../src/etl_pipeline/extract.py
base_parameters:
source_table: ${var.catalog}.${var.schema}.raw_data
existing_cluster_id: ${var.cluster_id}# Deploy to any environment
databricks bundle deploy -t dev
databricks bundle deploy -t prod
# Run the job
databricks bundle run etl_pipeline -t devBenefits:
- ✅ Single source of truth in Git
- ✅ Guaranteed environment parity
- ✅ Code review process
- ✅ Automated validation
- ✅ One command deployment
- ✅ Easy rollback (git revert)
See notebooks/old_approach/manual_job_setup.md for detailed comparison.
┌─────────────────────────────────────────────────────────────┐
│ GitHub │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Terraform Code │ │ DAB Config │ │
│ │ (terraform/) │ │ (databricks.yml) │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ ┌────────▼──────────────────────────────▼─────────┐ │
│ │ GitHub Actions Workflows │ │
│ │ ├─ terraform-deploy.yml │ │
│ │ └─ dab-deploy.yml │ │
│ └───────────────────┬──────────────────────────────┘ │
└────────────────────────┼────────────────────────────────────┘
│
┌────────────────┼────────────────┐
│ │
▼ ▼
┌───────────────────┐ ┌──────────────────────┐
│ Terraform │ │ Databricks │
│ Remote State │ │ Workspace │
│ (Azure Storage) │ │ │
└───────────────────┘ │ ┌────────────────┐ │
│ │ Cluster │ │
│ ├────────────────┤ │
│ │ Jobs │ │
│ │ • Setup │ │
│ │ • ETL Pipeline │ │
│ │ • ML Training │ │
│ ├────────────────┤ │
│ │ Delta Tables │ │
│ ├────────────────┤ │
│ │ MLflow │ │
│ │ Experiments │ │
│ └────────────────┘ │
└──────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Step 1: Code Changes │
│ ─────────────────────────────────────────────────────── │
│ Developer commits changes to: │
│ • Job configurations (resources/*.yml) │
│ • Notebook code (src/**/*.py) │
│ • DAB config (databricks.yml) │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 2: GitHub Actions Trigger │
│ ─────────────────────────────────────────────────────── │
│ Workflow: dab-deploy.yml │
│ • Checkout code │
│ • Setup Databricks CLI │
│ • Authenticate (DATABRICKS_HOST + TOKEN) │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 3: DAB Validation │
│ ─────────────────────────────────────────────────────── │
│ databricks bundle validate -t <env> │
│ • Check YAML syntax │
│ • Validate variable references │
│ • Verify notebook paths │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 4: DAB Deployment │
│ ─────────────────────────────────────────────────────── │
│ databricks bundle deploy -t <env> │
│ • Upload notebooks to workspace │
│ • Create/update job definitions │
│ • Update job parameters │
│ • Track deployment state │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 5: Jobs Ready to Run │
│ ─────────────────────────────────────────────────────── │
│ Jobs are deployed and ready in Databricks workspace: │
│ • Manual trigger via UI │
│ • Scheduled execution │
│ • API/CLI trigger: databricks bundle run <job> -t <env> │
└─────────────────────────────────────────────────────────┘
# databricks.yml
targets:
dev:
mode: development
# Uses hive_metastore.dab_lab
# Models registered to Staging
prod:
mode: production
# Can override catalog/schema
# Models registered to ProductionEnvironment Parity: Same code, different configurations
- Variable substitution:
${bundle.target} - Environment-specific parameters
- Different resource naming
- Separate MLflow experiments
Error: azure-client-id is required
Solution:
- Verify all secrets are set in GitHub Actions
- Check secret names match exactly (case-sensitive)
- Verify Service Principal credentials are valid
Error: cluster '<cluster-id>' not found
Error: RESOURCE_DOES_NOT_EXIST: Workspace not found
Root Cause: After running terraform destroy and terraform apply, the following values change:
- Databricks workspace URL (
DATABRICKS_HOST) - Cluster ID (
DATABRICKS_CLUSTER_ID) - Access tokens (
DATABRICKS_TOKEN)
Solution - Update All Affected Secrets:
-
Get new Databricks Host:
cd terraform terraform output databricks_host # Example output: adb-1234567890123456.7.azuredatabricks.net
-
Get new Cluster ID:
terraform output databricks_cluster_id # Example output: 1229-221552-7wmjd6ef -
Generate new Databricks Token:
- Login to the NEW Databricks workspace URL
- User Settings → Access Tokens → Generate New Token
- Copy the token value
-
Update GitHub Secrets (all three must be updated):
# Update Databricks Host gh secret set DATABRICKS_HOST --body="<new-workspace-url>" # Update Cluster ID gh secret set DATABRICKS_CLUSTER_ID --body="<new-cluster-id>" # Update Access Token gh secret set DATABRICKS_TOKEN --body="<new-token>"
-
Redeploy DAB:
databricks bundle deploy -t dev --var="cluster_id=<new-cluster-id>"
Important: This is required EVERY time you run terraform destroy followed by terraform apply, as new Databricks resources are created with different IDs.
Error: failed to load databricks.yml
Solution:
- Check YAML syntax (indentation, quotes)
- Validate variable references:
${var.variable_name} - Ensure notebook paths are correct (relative to bundle root)
Error: Cluster <id> does not exist
Solution:
- Verify cluster is running in Databricks UI
- Check
cluster_idvariable matches deployed cluster - Ensure cluster has not auto-terminated
Error: RESOURCE_DOES_NOT_EXIST: Workspace directory '/Shared/dab-lab/experiments' not found
Solution: The MLflow experiments directory is created by Terraform. If you see this error:
- Verify Terraform deployment completed successfully
- Check the directory exists in Databricks: Workspace → Shared → dab-lab → experiments
- If missing, re-run Terraform apply
Important for Cleanup: Before running terraform destroy, manually delete the MLflow experiments directory and its contents from the Databricks UI to avoid "directory not empty" errors.
Error: Table or view not found: hive_metastore.dab_lab.raw_customer_data
Solution: Run the setup job first:
databricks bundle run setup_sample_data -t devEnable debug logging:
# Terraform
export TF_LOG=DEBUG
terraform apply
# Databricks CLI
databricks bundle deploy -t dev --debug
# Azure CLI
az login --debugFor issues not covered here:
- Check Databricks Asset Bundles Documentation
- Review GitHub Actions workflow logs
- Check Databricks job run logs in the workspace UI
- Open an issue in this repository
Contributions are welcome! This repository is designed as a learning resource and demonstration of DAB best practices.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Test your changes thoroughly
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Additional DAB job examples
- Enhanced error handling
- Additional data quality checks
- Performance optimizations
- Documentation improvements
- Bug fixes
This project is licensed under the MIT License - see the LICENSE file for details.
- Databricks for the Asset Bundles framework
- HashiCorp for Terraform
- The data engineering and MLOps community
For questions or feedback:
- Open an issue in this repository
- Follow me on Medium for the full article
Built with ❤️ for the Data Engineering community
Keywords: Databricks, Asset Bundles, DAB, Azure, Terraform, CI/CD, MLOps, DataOps, Infrastructure as Code, GitHub Actions, ETL, Machine Learning