You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`upstream-prod` is a dbt package for easily using production data in a development environment. It's an alternative to the [defer flag](https://docs.getdbt.com/reference/node-selection/defer) - only without the need to find and download a production manifest.
3
+
`upstream-prod` is a dbt package for easily using production data in a development environment. It's a hands-off alternative to the [defer flag](https://docs.getdbt.com/reference/node-selection/defer) - only without the need to find and download a production manifest - and was inspired by [similar work by Monzo](https://monzo.com/blog/2021/10/14/an-introduction-to-monzos-data-stack).
4
4
5
-
It is inspired by (but unrelated to) [similar work by Monzo](https://monzo.com/blog/2021/10/14/an-introduction-to-monzos-data-stack).
5
+
### Why do I need it?
6
6
7
-
> #### ⚠️ Setup instructions changed in version `0.5.0` - make sure to review them if updating from an earlier version.
7
+
In a typical project, prod and dev models are materialised in [separate environments](https://docs.getdbt.com/docs/core/dbt-core-environments). Although this ensures end users are unaffected by ongoing development, there's a significant downside: the isolation means that each environment needs a complete, up-to-date copy of _every_ model. This can be challenging for complex projects or long-running models, and out-of-date data can cause frustrating errors.
8
8
9
-
## How it works
10
-
I would often get errors when developing locally because my dev database had outdated or missing data. Rebuilding models was often time-consuming and it wasn't easy to tell how many layers deep I needed to go.
9
+
`upstream-prod` solves this by intelligently redirecting `ref`s to prod outputs. It is highly adaptable and can be used whether your environments are in separate schemas, databases, or a combination of both. On most warehouses it can even compare dev and prod outputs and use the most recently-updated relation.
11
10
12
-
`upstream-prod` fixes this by overriding the `{{ ref() }}` macro, redirecting `ref`s in selected models & tests to their equivalent relations in your production environment.
13
11
14
-
For example, with a simple DAG:
15
-
```
16
-
model_1 -> model_2 -> model_3
17
-
```
18
-
And `profiles.yml`:
19
-
```yml
20
-
jaffle_shop:
21
-
target: dev
22
-
outputs:
23
-
dev:
24
-
...
25
-
prod:
26
-
...
12
+
## Setup
13
+
14
+
The package relies on a few variables that indicate where prod data is avaiable. The exact requirements depend on your setup; use the questions below to find the correct variables for your project.
15
+
16
+
#### 1. Does your project have a custom schema macro?
17
+
18
+
If you aren't sure, check your `macros` directory for a macro called `generate_schema_name`. The exact filename may differ - [dbt's docs](https://docs.getdbt.com/docs/build/custom-schemas#a-built-in-alternative-pattern-for-generating-schema-names) call it `get_custom_schema.sql` - so you may need to check the file contents.
19
+
20
+
#### 2. Do your dev & prod environments use the same database?
21
+
Your platform may use a different term, such as _catalog_ on Databricks or _project_ on BigQuery.
| Dev & prod in same database | Setup A | Setup B |
28
+
| Dev & prod in different databases | Setup C | Setup D |
29
+
30
+
31
+
<!-- START COLLAPSIBLE SECTIONS -->
32
+
33
+
<!-- A: custom macro & same database -->
34
+
<details><summary>Setup A</summary>
35
+
<br/>
36
+
37
+
The custom macro requires two small tweaks to work with the package. This is easiest to explain with an example, so here is how to modify the [built-in `generate_schema_name_for_env` macro](https://github.com/dbt-labs/dbt-adapters/blob/6e765f58d1a15f7fcc15e504916543bd55bd62b7/dbt/include/global_project/macros/get_custom_name/get_custom_schema.sql#L47-L60).
38
+
39
+
```sql
40
+
-- 1. Add an is_upstream_prod parameter that defaults to False
-- 2. In the clause that generates your prod schema names, add a check that the value is True
44
+
-- **Make sure to enclose the or condition in brackets**
45
+
{%- if (target.name=="prod"or is_upstream_prod == true) and custom_schema_name is not none -%}
46
+
{{ custom_schema_name | trim }}
47
+
{%- else -%}
48
+
{{ default_schema }}
49
+
{%- endif -%}
50
+
{%- endmacro %}
27
51
```
28
-
When `dbt build -s model_2+` is run:
29
-
- `dev.model_2`is built using data from `prod.model_1`.
30
-
- `dev.model_3`is built on top of `dev.model_2`.
31
-
- Tests are run against the `dev.model_2` and `dev.model_3`.
32
52
33
-
> For tests that refer to multipe tables, such as relationship tests, the `prod` version of the comparison model will be used when available.
53
+
<br/>
34
54
35
-
The selected models are now available in your development environment with production-quality data. The package can optionally return a `dev` model when the `prod` version can't be found. This is useful when adding several new models at once.
55
+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
56
+
-`upstream_prod_fallback` tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
57
+
-`upstream_prod_prefer_recent` compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
58
+
-`upstream_prod_disabled_targets` is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
36
59
37
-
## Setup
60
+
```yml
61
+
# dbt_project.yml
62
+
vars:
63
+
# Required
64
+
upstream_prod_env_schemas: true
65
+
# Optional, but recommended
66
+
upstream_prod_fallback: true
67
+
upstream_prod_prefer_recent: true
68
+
upstream_prod_disabled_targets:
69
+
- prod
70
+
```
71
+
</details>
38
72
39
-
### 1. Required variables
73
+
<!-- B: no custom macro & same database -->
74
+
<details><summary>Setup B</summary>
75
+
<br/>
40
76
41
-
Add the relevant variables to `dbt_project.yml`. This varies depending on how your project is configured. The examples below should help you identify your project setup:
77
+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
78
+
- `upstream_prod_fallback`tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
79
+
- `upstream_prod_prefer_recent`compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
80
+
- `upstream_prod_disabled_targets`is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
Open `profiles.yml` and find the relevant details for your setup:
50
-
- **Dev databases**: set `upstream_prod_database` to the `database` value of your `prod` target.
51
-
- **Custom schemas**: set `upstream_prod_schema` to the `schema` value of your `prod` target.
52
-
- **Env schemas**: set `upstream_prod_env_schemas` to `True`.
99
+
The custom macro requires two small tweaks to work with the package. This is easiest to explain with an example, so here is how to modify the [built-in `generate_schema_name_for_env` macro](https://github.com/dbt-labs/dbt-adapters/blob/6e765f58d1a15f7fcc15e504916543bd55bd62b7/dbt/include/global_project/macros/get_custom_name/get_custom_schema.sql#L47-L60).
53
100
54
-
When using env schemas, you also need to add the `is_upstream_prod` parameter to your `generate_schema_name` macro:
55
101
```sql
56
-
-- is_upstream_prod should default to False
102
+
-- 1. Add an is_upstream_prod parameter that defaults to False
-- Add the parameter to the clause that generates your prod schema names, making sure to
60
-
-- enclose the *or* condition in brackets
105
+
-- 2. In the clause that generates your prod schema names, add a check that the value is True
106
+
-- **Make sure to enclose the or condition in brackets**
61
107
{%- if (target.name == "prod" or is_upstream_prod == true) and custom_schema_name is not none -%}
62
108
{{ custom_schema_name | trim }}
63
109
{%- else -%}
@@ -66,41 +112,89 @@ When using env schemas, you also need to add the `is_upstream_prod` parameter to
66
112
{%- endmacro %}
67
113
```
68
114
69
-
> These options can be combined. For example, if you use dev databases and env schemas you would set both `upstream_prod_database` and `upstream_prod_env_schemas`.
115
+
<br/>
116
+
117
+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
118
+
- `upstream_prod_fallback`tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
119
+
- `upstream_prod_prefer_recent`compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
120
+
- `upstream_prod_disabled_targets`is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
121
+
122
+
```yml
123
+
# dbt_project.yml
124
+
vars:
125
+
# Required
126
+
upstream_prod_database: <prod_database_name>
127
+
upstream_prod_env_schemas: true
128
+
# Optional, but recommended
129
+
upstream_prod_fallback: true
130
+
upstream_prod_prefer_recent: true
131
+
upstream_prod_disabled_targets:
132
+
- prod
133
+
```
134
+
135
+
<details><summary><b>Advanced: projects with multiple prod & dev databases</b></summary>
136
+
<br/>
137
+
138
+
If you project materialises models in more than one database per env, use `upstream_prod_database_replace` instead of `upstream_prod_database`. You can then provide a two-item list with values to find and their replacement strings.
139
+
140
+
For example, a project that materialises `models/marts` in one database and everything else in another would use 4 databases:
141
+
- During development
142
+
- `models/marts`→ `dev_marts_db`
143
+
- Everything else → `dev_stg_db`
144
+
- In production
145
+
- `models/marts`→ `prod_marts_db`
146
+
- Everything else → `prod_stg_db`
147
+
148
+
Setting `upstream_prod_database_replace: [dev, prod]` would allow the package to work with this project.
149
+
</details>
70
150
71
-
### 2. Optional variables
72
-
- `upstream_prod_enabled`: Disables the package when False. Defaults to True.
73
-
- `upstream_prod_disabled_targets`: List of targets where the package should be disabled.
74
-
- `upstream_prod_fallback`: Whether to fall back to the default target when a model can't be found in prod. Defaults to False.
75
-
- `upstream_prod_prefer_recent`: Whether to use dev relations that were updated more recently than prod; particularly useful when working on multiple large / slow models at once. Only supported in Snowflake & BigQuery. Defaults to False.
151
+
</details>
76
152
77
-
**Example**
153
+
<!-- D: no custom macro & different databases -->
154
+
<details><summary>Setup D</summary>
155
+
<br/>
78
156
79
-
I use Snowflake and each developer has a separate database with identically-named schemas. This is how my project is configured:
157
+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
158
+
- `upstream_prod_fallback`tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
159
+
- `upstream_prod_prefer_recent`compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
160
+
- `upstream_prod_disabled_targets`is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
80
161
81
162
```yml
82
163
# dbt_project.yml
83
164
vars:
84
-
upstream_prod_database: <prod_db> # replace with your prod db
85
-
upstream_prod_fallback: True
86
-
upstream_prod_prefer_recent: True
165
+
# Required
166
+
upstream_prod_database: <prod_database_name>
167
+
upstream_prod_schema: <prod_schema_name/prefix>
168
+
# Optional, but recommended
169
+
upstream_prod_fallback: true
170
+
upstream_prod_prefer_recent: true
87
171
upstream_prod_disabled_targets:
88
-
- ci
89
172
- prod
90
173
```
91
174
92
-
The integration tests provide examples for all supported setups:
<details><summary><b>Advanced: projects with multiple prod & dev databases</b></summary>
176
+
<br/>
98
177
99
-
### 3. Update `ref()`
100
-
dbt needs to use this package's version of `{{ ref() }}` instead of the builtin macro. The recommended approach is to create a thin wrapper around `upstream-prod`.
178
+
If you project materialises models in more than one database per env, use `upstream_prod_database_replace` instead of `upstream_prod_database`. You can then provide a two-item list with values to find and their replacement strings.
179
+
180
+
For example, a project that materialises `models/marts` in one database and everything else in another would use 4 databases:
181
+
- During development
182
+
- `models/marts`→ `dev_marts_db`
183
+
- Everything else → `dev_stg_db`
184
+
- In production
185
+
- `models/marts`→ `prod_marts_db`
186
+
- Everything else → `prod_stg_db`
187
+
188
+
Setting `upstream_prod_database_replace: [dev, prod]` would allow the package to work with this project.
189
+
</details>
190
+
191
+
</details>
192
+
<!-- END COLLAPSIBLE SECTIONS -->
193
+
194
+
### 4. Create a custom `ref()` macro
101
195
102
196
In your `macros` directory, create a file called `ref.sql` with the following contents:
@@ -128,13 +222,58 @@ In your `macros` directory, create a file called `ref.sql` with the following co
128
222
{% endmacro %}
129
223
```
130
224
131
-
Alternatively, you can find any instances of `{{ ref() }}` in your project and replace them with `{{ upstream_prod.ref() }}`.
225
+
Alternatively, you can find any instances of `{{ ref() }}` in your project and replace them with `{{ upstream_prod.ref() }}`. This is suitable for testing the package but is not recommended for general use.
226
+
227
+
## How it works
228
+
229
+
Assume your project has an `events` model that depends on intermediate and staging layers. The simplified DAGs looks like this:
You want to change `int_events`, so you need a copy of `stg_events` in dev. This could be expensive and time-consuming to create from scratch, and it could slow down your development process considerably. Perhaps this model already exists from previous work, but is it up-to-date? If the model definition or underlying data has changed, your dev model may break in prod.
247
+
248
+
`upstream-prod` sovles this problem by intelligently redirecting `ref`s based on the selected models for the current run. Running `dbt build -s int_events+` would:
249
+
250
+
1. Create `dev.int_events` using data from `prod.stg_events`
251
+
2. Create `dev.events` on top of `dev.int_events`, since the package recognises that `int_events` has been selected
252
+
3. Run tests against `dev.int_events` and `dev.events`
253
+
254
+
Now that your dev models are using prod data, you DAG would look like this:
`upstream-prod` was initially designed on Snowflake and is now primarily tested on Databricks. Based on my experience and user reports, it is known to work on:
135
273
- Snowflake
136
-
- BigQuery
137
-
- Redshift ([RA3 nodes](https://aws.amazon.com/redshift/features/ra3/) are required to query across databases)
138
274
- Databricks
275
+
- BigQuery
276
+
- Redshift (you may need [RA3 nodes](https://aws.amazon.com/redshift/features/ra3/) for cross-database queries)
277
+
- Azure Synapse
139
278
140
-
It should also work with community-supported adapters that specify a target database and schema - PRs are welcome if it doesn't!
279
+
It should also work with community-supported adapters that specify a target database or schema in `profiles.yml`.
0 commit comments