Skip to content

snowplow-incubator/snowplow-databricks-loader

Repository files navigation

Snowplow Databricks Loader

Build Status Release License

This project contains applications required to load Snowplow data into Databricks with low latency.

Check out the example config files for how to configure your loader.

Step 1: Run the loader

The Databricks loader reads the stream of enriched events and pushes staging files to a Databricks volume

Basic usage: `

docker run \
  -v /path/to/config.hocon:/var/config.hocon \
  snowplow/databricks-loader-<flavour>:0.2.0 \
  --config=/var/config.hocon \
  --iglu-config=/var/iglu.hocon

...where <flavour> is either kinesis (for AWS), pubsub (for GCP) or kafka (for Azure).

Step 2: Run a Databricks Lakeflow Declarative Pipeline

Create a Pipeline in your Databricks workspace and and copy the following SQL into the associated .sql file:

CREATE STREAMING LIVE TABLE events
CLUSTER BY (load_tstamp, event_name)
TBLPROPERTIES (
  'delta.dataSkippingStatsColumns' =
      'load_tstamp,collector_tstamp,derived_tstamp,dvce_created_tstamp,true_tstamp,event_name'
)
AS SELECT
  *,
  current_timestamp() as load_tstamp
FROM cloud_files(
  "/Volumes/<CATALOG_NAME>/<VOLUME_NAME>/<SCHEMA_NAME>/events",
  "parquet",
  map(
    "cloudfiles.inferColumnTypes", "false",
    "cloudfiles.includeExistingFiles", "false", -- set to true to load files already present in the volume
    "cloudfiles.schemaEvolutionMode", "addNewColumns",
    "cloudfiles.partitionColumns", "",
    "cloudfiles.useManagedFileEvents", "true",
    "datetimeRebaseMode", "CORRECTED",
    "int96RebaseMode", "CORRECTED",
    "mergeSchema", "true"
  )
)

Replace /Volumes/<CATALOG_NAME>/<VOLUME_NAME>/<SCHEMA_NAME>/events with the correct path to your volume.

Find out more

Technical Docs Setup Guide Roadmap & Contributing
i1 i2 i3
Technical Docs Setup Guide Roadmap

Copyright and License

Copyright (c) 2012-present Snowplow Analytics Ltd. All rights reserved.

Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)

About

Snowplow Databricks Loader

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages