Skip to content

Commit

Permalink
Feature union schemas (#55)
Browse files Browse the repository at this point in the history
* Initial changes to support union_schemas

* File rename and additional tweaks

* Changelog and version updates

* Remove testing vars

* Stage

* Redshift

* change schema

* docs and source stuff

* docs

* setup union run

* union

* use ref instead of source

* remove schema cleanup

* try with has_defined_sources = true

* tests passed

* dynamically turn off tmp model

* docs

* modelcount

* get ready for release

* Apply suggestions from code review

Co-authored-by: Avinash Kunnath <[email protected]>

* avinash feedback

---------

Co-authored-by: Brian Toole <[email protected]>
Co-authored-by: Avinash Kunnath <[email protected]>
  • Loading branch information
3 people authored Jan 22, 2025
1 parent 6252a92 commit 601066e
Show file tree
Hide file tree
Showing 26 changed files with 835 additions and 100 deletions.
13 changes: 12 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
# dbt_mixpanel version.version
# dbt_mixpanel v0.11.0
[PR #53](https://github.com/fivetran/dbt_mixpanel/pull/53) and [PR #55](https://github.com/fivetran/dbt_mixpanel/pull/55) include the following updates:

## Feature Update: Run Package on Unioned Connections
- This release supports running the package on multiple Mixpanel sources at once! See the [README](https://github.com/fivetran/dbt_mixpanel?tab=readme-ov-file#step-3-define-database-and-schema-variables) for details on how to leverage this feature.
- This was achieved through the introduction of new unioning [macros](https://github.com/fivetran/dbt_mixpanel/tree/main/macros/union).

> Please note: This is a **Breaking Change** in that we have a added a new field, `source_relation`, that points to the source connection from which the record originated.
> This `source_relation` field is now part of all generated unique keys.
>
> This will **require running a full refresh**.
## Documentation
- Provided missing column yml documentation.
- Added Quickstart model counts to README. ([#56](https://github.com/fivetran/dbt_mixpanel/pull/56))
- Corrected references to connectors and connections in the README. ([#56](https://github.com/fivetran/dbt_mixpanel/pull/56))

Expand Down
71 changes: 68 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,17 +67,19 @@ For **BigQuery** and **Databricks All Purpose Cluster runtime** destinations, we
For **Snowflake**, **Redshift**, and **Postgres** databases, we have chosen `delete+insert` as the default strategy.

> Regardless of strategy, we recommend that users periodically run a `--full-refresh` to ensure a high level of data quality.

### Step 2: Install the package
Include the following mixpanel package version in your `packages.yml` file:
> TIP: Check [dbt Hub](https://hub.getdbt.com/) for the latest installation instructions or [read the dbt docs](https://docs.getdbt.com/docs/package-management) for more information on installing packages.

```yaml
packages:
- package: fivetran/mixpanel
version: [">=0.10.0", "<0.11.0"] # we recommend using ranges to capture non-breaking changes automatically
version: [">=0.11.0", "<0.12.0"] # we recommend using ranges to capture non-breaking changes automatically
```

### Step 3: Define database and schema variables
#### Option A: Single connection
By default, this package runs using your destination and the `mixpanel` schema. If this is not where your Mixpanel data is (for example, if your Mixpanel schema is named `mixpanel_fivetran`), add the following configuration to your root `dbt_project.yml` file:

```yml
Expand All @@ -86,6 +88,69 @@ vars:
mixpanel_schema: your_schema_name
```

#### Option B: Union multiple connections
If you have multiple Mixpanel connections in Fivetran and would like to use this package on all of them simultaneously, we have provided functionality to do so. For each source table, the package will union all of the data together and pass the unioned table into the transformations. The `source_relation` column in each model indicates the origin of each record.

To use this functionality, you will need to set the `mixpanel_sources` variable in your root `dbt_project.yml` file:

```yml
# dbt_project.yml
vars:
mixpanel_sources:
- database: connection_1_destination_name # Likely Required. Default value = target.database
schema: connection_1_schema_name # Likely Required. Default value = 'mixpanel'
name: connection_1_source_name # Required only if following the step in the following subsection
- database: connection_2_destination_name
schema: connection_2_schema_name
name: connection_2_source_name
```

> **Note:** If you choose to make use of this unioning functionality, you will incur an additional model materialized as a `view`, called `stg_mixpanel__event_tmp`. This extra model is necessary for the proper compilation of our connection-unioning macros.

##### Recommended: Incorporate unioned sources into DAG
> *If you are running the package through [Fivetran Transformations for dbt Core™](https://fivetran.com/docs/transformations/dbt#transformationsfordbtcore), the below step is necessary in order to synchronize model runs with your Mixpanel connections. Alternatively, you may choose to run the package through Fivetran [Quickstart](https://fivetran.com/docs/transformations/quickstart), which would create separate sets of models for each Mixpanel source rather than one set of unioned models.*

<details><summary>Expand for details</summary>
<br>

By default, this package defines one single-connection source, called `mixpanel`, which will be disabled if you are unioning multiple connections. This means that your DAG will not include your Mixpanel sources, though the package will run successfully.

To properly incorporate all of your Mixpanel connections into your project's DAG:
1. Define each of your sources in a `.yml` file in your project. Utilize the following template for the `source`-level configurations, and, **most importantly**, copy and paste the table and column-level definitions from the package's `src_mixpanel.yml` [file](https://github.com/fivetran/dbt_mixpanel/blob/main/models/staging/src_mixpanel.yml). This package currently only uses the `EVENT` source table.

```yml
# a .yml file in your root project
version: 2
sources:
- name: <name> # ex: Should match name in mixpanel_sources
schema: <schema_name>
database: <database_name>
loader: fivetran
loaded_at_field: _fivetran_synced
freshness: # feel free to adjust to your liking
warn_after: {count: 72, period: hour}
error_after: {count: 168, period: hour}
tables:
- name: event
description: Table of all events tracked by Mixpanel across web, ios, and android platforms.
columns: # copy and paste from mixpanel/models/staging/src_mixpanel.yml - see https://support.atlassian.com/bitbucket-cloud/docs/yaml-anchors/ for how to use &/* anchors to only do so once
```

2. Set the `has_defined_sources` variable (scoped to the `mixpanel` package) to `True`, like such:
```yml
# dbt_project.yml
vars:
mixpanel:
has_defined_sources: true
```

</details>

### (Optional) Step 4: Additional configurations
<details open><summary>Collapse/expand details</summary>

Expand Down Expand Up @@ -224,8 +289,8 @@ models:
+schema: my_new_schema_name # leave blank for just the target_schema
```

#### Change the source table references
If an individual source table has a different name than the package expects, add the table name as it appears in your destination to the respective variable:
#### Change the source table references (only if using a single connection)
If an individual source table has a different name than the package expects, add the table name as it appears in your destination to the respective variable. This is not available when running the package on multiple unioned connections.

> IMPORTANT: See this project's [`dbt_project.yml`](https://github.com/fivetran/dbt_mixpanel/blob/main/dbt_project.yml) variable declarations to see the expected names.

Expand Down
3 changes: 2 additions & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
config-version: 2
name: 'mixpanel'
version: '0.10.0'
version: '0.11.0'
require-dbt-version: [">=1.3.0", "<2.0.0"]
models:
mixpanel:
Expand All @@ -23,3 +23,4 @@ vars:
# session_event_criteria: # filter to place on events in order to qualify for sessionization
sessionization_trailing_window: 3 # number of hours to look back at for each mixpanel__sessions run. this allows you to sessionize events that arrive late without requiring a full refresh
session_passthrough_columns: [] # choose event columns to pass through to mixpanel__sessions (values taken from first event of session)
mixpanel_sources: []
2 changes: 1 addition & 1 deletion docs/catalog.json

Large diffs are not rendered by default.

253 changes: 214 additions & 39 deletions docs/index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/manifest.json

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion docs/run_results.json

This file was deleted.

10 changes: 5 additions & 5 deletions integration_tests/ci/sample.profiles.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ integration_tests:
pass: "{{ env_var('CI_REDSHIFT_DBT_PASS') }}"
dbname: "{{ env_var('CI_REDSHIFT_DBT_DBNAME') }}"
port: 5439
schema: mixpanel_integration_tests_2
schema: mixpanel_integration_tests_3
threads: 8
bigquery:
type: bigquery
method: service-account-json
project: 'dbt-package-testing'
schema: mixpanel_integration_tests_2
schema: mixpanel_integration_tests_3
threads: 8
keyfile_json: "{{ env_var('GCLOUD_SERVICE_KEY') | as_native }}"
snowflake:
Expand All @@ -33,7 +33,7 @@ integration_tests:
role: "{{ env_var('CI_SNOWFLAKE_DBT_ROLE') }}"
database: "{{ env_var('CI_SNOWFLAKE_DBT_DATABASE') }}"
warehouse: "{{ env_var('CI_SNOWFLAKE_DBT_WAREHOUSE') }}"
schema: mixpanel_integration_tests_2
schema: mixpanel_integration_tests_3
threads: 8
postgres:
type: postgres
Expand All @@ -42,13 +42,13 @@ integration_tests:
pass: "{{ env_var('CI_POSTGRES_DBT_PASS') }}"
dbname: "{{ env_var('CI_POSTGRES_DBT_DBNAME') }}"
port: 5432
schema: mixpanel_integration_tests_2
schema: mixpanel_integration_tests_3
threads: 8
databricks:
catalog: "{{ env_var('CI_DATABRICKS_DBT_CATALOG') }}"
host: "{{ env_var('CI_DATABRICKS_DBT_HOST') }}"
http_path: "{{ env_var('CI_DATABRICKS_DBT_HTTP_PATH') }}"
schema: mixpanel_integration_tests_2
schema: mixpanel_integration_tests_3
threads: 8
token: "{{ env_var('CI_DATABRICKS_DBT_TOKEN') }}"
type: databricks
Expand Down
14 changes: 11 additions & 3 deletions integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: 'mixpanel_integration_tests'
version: '0.10.0'
version: '0.11.0'
config-version: 2
profile: 'integration_tests'

Expand All @@ -9,10 +9,18 @@ models:
# +schema: "mixpanel_{{ var('directed_schema','dev') }}" ## To be used for validation testing

vars:
mixpanel_schema: mixpanel_integration_tests_2
mixpanel_schema: mixpanel_integration_tests_3

# mixpanel_sources:
# - schema: mixpanel_integration_tests_3
# name: source_3
# - schema: mixpanel_integration_tests_4
# name: source_4

mixpanel:
mixpanel_event_identifier: "event"

# has_defined_sources: true

seeds:
mixpanel_integration_tests:
+column_types:
Expand Down
Loading

0 comments on commit 601066e

Please sign in to comment.