This project contains applications required to load Snowplow data into Databricks with low latency.
Check out the example config files for how to configure your loader.
The Databricks loader reads the stream of enriched events and pushes staging files to a Databricks volume
Basic usage: `
docker run \
-v /path/to/config.hocon:/var/config.hocon \
snowplow/databricks-loader-<flavour>:0.2.0 \
--config=/var/config.hocon \
--iglu-config=/var/iglu.hocon
...where <flavour>
is either kinesis
(for AWS), pubsub
(for GCP) or kafka
(for Azure).
Create a Pipeline in your Databricks workspace and and copy the following SQL into the associated .sql file:
CREATE STREAMING LIVE TABLE events
CLUSTER BY (load_tstamp, event_name)
TBLPROPERTIES (
'delta.dataSkippingStatsColumns' =
'load_tstamp,collector_tstamp,derived_tstamp,dvce_created_tstamp,true_tstamp,event_name'
)
AS SELECT
*,
current_timestamp() as load_tstamp
FROM cloud_files(
"/Volumes/<CATALOG_NAME>/<VOLUME_NAME>/<SCHEMA_NAME>/events",
"parquet",
map(
"cloudfiles.inferColumnTypes", "false",
"cloudfiles.includeExistingFiles", "false", -- set to true to load files already present in the volume
"cloudfiles.schemaEvolutionMode", "addNewColumns",
"cloudfiles.partitionColumns", "",
"cloudfiles.useManagedFileEvents", "true",
"datetimeRebaseMode", "CORRECTED",
"int96RebaseMode", "CORRECTED",
"mergeSchema", "true"
)
)
Replace /Volumes/<CATALOG_NAME>/<VOLUME_NAME>/<SCHEMA_NAME>/events
with the correct path to your volume.
Technical Docs | Setup Guide | Roadmap & Contributing |
---|---|---|
Technical Docs | Setup Guide | Roadmap |
Copyright (c) 2012-present Snowplow Analytics Ltd. All rights reserved.
Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)