Skip to content

Commit 7f9579f

Browse files
authored
Update README (#33)
1 parent 9aa2988 commit 7f9579f

File tree

1 file changed

+206
-67
lines changed

1 file changed

+206
-67
lines changed

README.md

+206-67
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,109 @@
11
# What is upstream-prod?
22

3-
`upstream-prod` is a dbt package for easily using production data in a development environment. It's an alternative to the [defer flag](https://docs.getdbt.com/reference/node-selection/defer) - only without the need to find and download a production manifest.
3+
`upstream-prod` is a dbt package for easily using production data in a development environment. It's a hands-off alternative to the [defer flag](https://docs.getdbt.com/reference/node-selection/defer) - only without the need to find and download a production manifest - and was inspired by [similar work by Monzo](https://monzo.com/blog/2021/10/14/an-introduction-to-monzos-data-stack).
44

5-
It is inspired by (but unrelated to) [similar work by Monzo](https://monzo.com/blog/2021/10/14/an-introduction-to-monzos-data-stack).
5+
### Why do I need it?
66

7-
> #### ⚠️ Setup instructions changed in version `0.5.0` - make sure to review them if updating from an earlier version.
7+
In a typical project, prod and dev models are materialised in [separate environments](https://docs.getdbt.com/docs/core/dbt-core-environments). Although this ensures end users are unaffected by ongoing development, there's a significant downside: the isolation means that each environment needs a complete, up-to-date copy of _every_ model. This can be challenging for complex projects or long-running models, and out-of-date data can cause frustrating errors.
88

9-
## How it works
10-
I would often get errors when developing locally because my dev database had outdated or missing data. Rebuilding models was often time-consuming and it wasn't easy to tell how many layers deep I needed to go.
9+
`upstream-prod` solves this by intelligently redirecting `ref`s to prod outputs. It is highly adaptable and can be used whether your environments are in separate schemas, databases, or a combination of both. On most warehouses it can even compare dev and prod outputs and use the most recently-updated relation.
1110

12-
`upstream-prod` fixes this by overriding the `{{ ref() }}` macro, redirecting `ref`s in selected models & tests to their equivalent relations in your production environment.
1311

14-
For example, with a simple DAG:
15-
```
16-
model_1 -> model_2 -> model_3
17-
```
18-
And `profiles.yml`:
19-
```yml
20-
jaffle_shop:
21-
target: dev
22-
outputs:
23-
dev:
24-
...
25-
prod:
26-
...
12+
## Setup
13+
14+
The package relies on a few variables that indicate where prod data is avaiable. The exact requirements depend on your setup; use the questions below to find the correct variables for your project.
15+
16+
#### 1. Does your project have a custom schema macro?
17+
18+
If you aren't sure, check your `macros` directory for a macro called `generate_schema_name`. The exact filename may differ - [dbt's docs](https://docs.getdbt.com/docs/build/custom-schemas#a-built-in-alternative-pattern-for-generating-schema-names) call it `get_custom_schema.sql` - so you may need to check the file contents.
19+
20+
#### 2. Do your dev & prod environments use the same database?
21+
Your platform may use a different term, such as _catalog_ on Databricks or _project_ on BigQuery.
22+
23+
#### 3. Choose the appropriate setup
24+
25+
| | Custom schema macro | No custom schema macro |
26+
|-----------------------------------|---------------------|------------------------|
27+
| Dev & prod in same database | Setup A | Setup B |
28+
| Dev & prod in different databases | Setup C | Setup D |
29+
30+
31+
<!-- START COLLAPSIBLE SECTIONS -->
32+
33+
<!-- A: custom macro & same database -->
34+
<details><summary>Setup A</summary>
35+
<br/>
36+
37+
The custom macro requires two small tweaks to work with the package. This is easiest to explain with an example, so here is how to modify the [built-in `generate_schema_name_for_env` macro](https://github.com/dbt-labs/dbt-adapters/blob/6e765f58d1a15f7fcc15e504916543bd55bd62b7/dbt/include/global_project/macros/get_custom_name/get_custom_schema.sql#L47-L60).
38+
39+
```sql
40+
-- 1. Add an is_upstream_prod parameter that defaults to False
41+
{% macro generate_schema_name(custom_schema_name, node, is_upstream_prod=False) -%}
42+
{%- set default_schema = target.schema -%}
43+
-- 2. In the clause that generates your prod schema names, add a check that the value is True
44+
-- **Make sure to enclose the or condition in brackets**
45+
{%- if (target.name == "prod" or is_upstream_prod == true) and custom_schema_name is not none -%}
46+
{{ custom_schema_name | trim }}
47+
{%- else -%}
48+
{{ default_schema }}
49+
{%- endif -%}
50+
{%- endmacro %}
2751
```
28-
When `dbt build -s model_2+` is run:
29-
- `dev.model_2` is built using data from `prod.model_1`.
30-
- `dev.model_3` is built on top of `dev.model_2`.
31-
- Tests are run against the `dev.model_2` and `dev.model_3`.
3252

33-
> For tests that refer to multipe tables, such as relationship tests, the `prod` version of the comparison model will be used when available.
53+
<br/>
3454

35-
The selected models are now available in your development environment with production-quality data. The package can optionally return a `dev` model when the `prod` version can't be found. This is useful when adding several new models at once.
55+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
56+
- `upstream_prod_fallback` tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
57+
- `upstream_prod_prefer_recent` compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
58+
- `upstream_prod_disabled_targets` is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
3659

37-
## Setup
60+
```yml
61+
# dbt_project.yml
62+
vars:
63+
# Required
64+
upstream_prod_env_schemas: true
65+
# Optional, but recommended
66+
upstream_prod_fallback: true
67+
upstream_prod_prefer_recent: true
68+
upstream_prod_disabled_targets:
69+
- prod
70+
```
71+
</details>
3872
39-
### 1. Required variables
73+
<!-- B: no custom macro & same database -->
74+
<details><summary>Setup B</summary>
75+
<br/>
4076
41-
Add the relevant variables to `dbt_project.yml`. This varies depending on how your project is configured. The examples below should help you identify your project setup:
77+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
78+
- `upstream_prod_fallback` tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
79+
- `upstream_prod_prefer_recent` compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
80+
- `upstream_prod_disabled_targets` is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
81+
82+
```yml
83+
# dbt_project.yml
84+
vars:
85+
# Required
86+
upstream_prod_schema: <prod_schema_name/prefix>
87+
# Optional, but recommended
88+
upstream_prod_fallback: true
89+
upstream_prod_prefer_recent: true
90+
upstream_prod_disabled_targets:
91+
- prod
92+
```
93+
</details>
4294

43-
| Setup | Prod examples | Dev examples |
44-
|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------|---------------------------------------------------------|
45-
| Dev databases | `db.prod.table`</br>`db.prod_stg.stg_table` | `dev_db.prod.table`</br>`dev_db.prod_stg.stg_table` |
46-
| Custom schemas ([docs](https://docs.getdbt.com/docs/build/custom-schemas#what-is-a-custom-schema)) | `db.prod.table`</br>`db.prod_stg.stg_table` | `db.dbt_<name>.table`</br>`db.dbt_<name>_stg.stg_table` |
47-
| Env schemas ([docs](https://docs.getdbt.com/docs/build/custom-schemas#advanced-custom-schema-configuration)) | `db.prod.table`</br>`db.stg.stg_table` | `db.dbt_<name>.table`</br>`db.dbt_<name>.stg_table` |
95+
<!-- C: custom macro & different databases -->
96+
<details><summary>Setup C</summary>
97+
<br/>
4898

49-
Open `profiles.yml` and find the relevant details for your setup:
50-
- **Dev databases**: set `upstream_prod_database` to the `database` value of your `prod` target.
51-
- **Custom schemas**: set `upstream_prod_schema` to the `schema` value of your `prod` target.
52-
- **Env schemas**: set `upstream_prod_env_schemas` to `True`.
99+
The custom macro requires two small tweaks to work with the package. This is easiest to explain with an example, so here is how to modify the [built-in `generate_schema_name_for_env` macro](https://github.com/dbt-labs/dbt-adapters/blob/6e765f58d1a15f7fcc15e504916543bd55bd62b7/dbt/include/global_project/macros/get_custom_name/get_custom_schema.sql#L47-L60).
53100

54-
When using env schemas, you also need to add the `is_upstream_prod` parameter to your `generate_schema_name` macro:
55101
```sql
56-
-- is_upstream_prod should default to False
102+
-- 1. Add an is_upstream_prod parameter that defaults to False
57103
{% macro generate_schema_name(custom_schema_name, node, is_upstream_prod=False) -%}
58104
{%- set default_schema = target.schema -%}
59-
-- Add the parameter to the clause that generates your prod schema names, making sure to
60-
-- enclose the *or* condition in brackets
105+
-- 2. In the clause that generates your prod schema names, add a check that the value is True
106+
-- **Make sure to enclose the or condition in brackets**
61107
{%- if (target.name == "prod" or is_upstream_prod == true) and custom_schema_name is not none -%}
62108
{{ custom_schema_name | trim }}
63109
{%- else -%}
@@ -66,41 +112,89 @@ When using env schemas, you also need to add the `is_upstream_prod` parameter to
66112
{%- endmacro %}
67113
```
68114

69-
> These options can be combined. For example, if you use dev databases and env schemas you would set both `upstream_prod_database` and `upstream_prod_env_schemas`.
115+
<br/>
116+
117+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
118+
- `upstream_prod_fallback` tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
119+
- `upstream_prod_prefer_recent` compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
120+
- `upstream_prod_disabled_targets` is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
121+
122+
```yml
123+
# dbt_project.yml
124+
vars:
125+
# Required
126+
upstream_prod_database: <prod_database_name>
127+
upstream_prod_env_schemas: true
128+
# Optional, but recommended
129+
upstream_prod_fallback: true
130+
upstream_prod_prefer_recent: true
131+
upstream_prod_disabled_targets:
132+
- prod
133+
```
134+
135+
<details><summary><b>Advanced: projects with multiple prod & dev databases</b></summary>
136+
<br/>
137+
138+
If you project materialises models in more than one database per env, use `upstream_prod_database_replace` instead of `upstream_prod_database`. You can then provide a two-item list with values to find and their replacement strings.
139+
140+
For example, a project that materialises `models/marts` in one database and everything else in another would use 4 databases:
141+
- During development
142+
- `models/marts` &rarr; `dev_marts_db`
143+
- Everything else &rarr; `dev_stg_db`
144+
- In production
145+
- `models/marts` &rarr; `prod_marts_db`
146+
- Everything else &rarr; `prod_stg_db`
147+
148+
Setting `upstream_prod_database_replace: [dev, prod]` would allow the package to work with this project.
149+
</details>
70150

71-
### 2. Optional variables
72-
- `upstream_prod_enabled`: Disables the package when False. Defaults to True.
73-
- `upstream_prod_disabled_targets`: List of targets where the package should be disabled.
74-
- `upstream_prod_fallback`: Whether to fall back to the default target when a model can't be found in prod. Defaults to False.
75-
- `upstream_prod_prefer_recent`: Whether to use dev relations that were updated more recently than prod; particularly useful when working on multiple large / slow models at once. Only supported in Snowflake & BigQuery. Defaults to False.
151+
</details>
76152

77-
**Example**
153+
<!-- D: no custom macro & different databases -->
154+
<details><summary>Setup D</summary>
155+
<br/>
78156

79-
I use Snowflake and each developer has a separate database with identically-named schemas. This is how my project is configured:
157+
Add the values below to the `vars` section of `dbt_project.yml`. Some optional variables are included to improve your experience:
158+
- `upstream_prod_fallback` tells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.
159+
- `upstream_prod_prefer_recent` compares when the prod and dev relations were last modified and returns the most recent. **This is only available on Snowflake, Databricks & BigQuery.**
160+
- `upstream_prod_disabled_targets` is used to bypass the package is certain environments. **It is highly recommended to disable the package for prod runs**.
80161

81162
```yml
82163
# dbt_project.yml
83164
vars:
84-
upstream_prod_database: <prod_db> # replace with your prod db
85-
upstream_prod_fallback: True
86-
upstream_prod_prefer_recent: True
165+
# Required
166+
upstream_prod_database: <prod_database_name>
167+
upstream_prod_schema: <prod_schema_name/prefix>
168+
# Optional, but recommended
169+
upstream_prod_fallback: true
170+
upstream_prod_prefer_recent: true
87171
upstream_prod_disabled_targets:
88-
- ci
89172
- prod
90173
```
91174

92-
The integration tests provide examples for all supported setups:
93-
- [Dev databases](https://github.com/LewisDavies/upstream-prod/tree/main/integration_tests/dev_db/dbt_project.yml)
94-
- [Custom schemas](https://github.com/LewisDavies/upstream-prod/tree/main/integration_tests/dev_sch/dbt_project.yml)
95-
- [Env schemas](https://github.com/LewisDavies/upstream-prod/tree/main/integration_tests/env_sch/dbt_project.yml)
96-
- [Dev databases & custom schemas](https://github.com/LewisDavies/upstream-prod/tree/main/integration_tests/dev_db_dev_sch/dbt_project.yml)
97-
- [Dev databases & env schemas](https://github.com/LewisDavies/upstream-prod/tree/main/integration_tests/dev_db_env_sch/dbt_project.yml)
175+
<details><summary><b>Advanced: projects with multiple prod & dev databases</b></summary>
176+
<br/>
98177

99-
### 3. Update `ref()`
100-
dbt needs to use this package's version of `{{ ref() }}` instead of the builtin macro. The recommended approach is to create a thin wrapper around `upstream-prod`.
178+
If you project materialises models in more than one database per env, use `upstream_prod_database_replace` instead of `upstream_prod_database`. You can then provide a two-item list with values to find and their replacement strings.
179+
180+
For example, a project that materialises `models/marts` in one database and everything else in another would use 4 databases:
181+
- During development
182+
- `models/marts` &rarr; `dev_marts_db`
183+
- Everything else &rarr; `dev_stg_db`
184+
- In production
185+
- `models/marts` &rarr; `prod_marts_db`
186+
- Everything else &rarr; `prod_stg_db`
187+
188+
Setting `upstream_prod_database_replace: [dev, prod]` would allow the package to work with this project.
189+
</details>
190+
191+
</details>
192+
<!-- END COLLAPSIBLE SECTIONS -->
193+
194+
### 4. Create a custom `ref()` macro
101195

102196
In your `macros` directory, create a file called `ref.sql` with the following contents:
103-
```python
197+
```sql
104198
{% macro ref(
105199
parent_model,
106200
prod_database=var("upstream_prod_database", None),
@@ -128,13 +222,58 @@ In your `macros` directory, create a file called `ref.sql` with the following co
128222
{% endmacro %}
129223
```
130224

131-
Alternatively, you can find any instances of `{{ ref() }}` in your project and replace them with `{{ upstream_prod.ref() }}`.
225+
Alternatively, you can find any instances of `{{ ref() }}` in your project and replace them with `{{ upstream_prod.ref() }}`. This is suitable for testing the package but is not recommended for general use.
226+
227+
## How it works
228+
229+
Assume your project has an `events` model that depends on intermediate and staging layers. The simplified DAGs looks like this:
230+
231+
```mermaid
232+
graph LR
233+
source[(Source)]
234+
source -.-> prod_stg[stg_events]
235+
source ==> dev_stg[stg_events]
236+
237+
subgraph prod
238+
prod_stg -.-> prod_int[int_events] -.-> prod_mart[events]
239+
end
240+
241+
subgraph dev
242+
dev_stg ==> dev_int[int_events] ==> dev_mart[events]
243+
end
244+
```
245+
246+
You want to change `int_events`, so you need a copy of `stg_events` in dev. This could be expensive and time-consuming to create from scratch, and it could slow down your development process considerably. Perhaps this model already exists from previous work, but is it up-to-date? If the model definition or underlying data has changed, your dev model may break in prod.
247+
248+
`upstream-prod` sovles this problem by intelligently redirecting `ref`s based on the selected models for the current run. Running `dbt build -s int_events+` would:
249+
250+
1. Create `dev.int_events` using data from `prod.stg_events`
251+
2. Create `dev.events` on top of `dev.int_events`, since the package recognises that `int_events` has been selected
252+
3. Run tests against `dev.int_events` and `dev.events`
253+
254+
Now that your dev models are using prod data, you DAG would look like this:
255+
```mermaid
256+
graph LR
257+
source[(Source)]
258+
source ==> prod_stg[stg_events]
259+
source -.-> dev_stg[stg_events]
260+
261+
subgraph prod
262+
prod_stg -.-> prod_int[int_events] -.-> prod_mart[events]
263+
end
264+
265+
subgraph dev
266+
dev_stg ~~~ dev_int
267+
prod_stg ==> dev_int[int_events] ==> dev_mart[events]
268+
end
269+
```
132270

133271
## Compatibility
134-
`upstream-prod` can be used on:
272+
`upstream-prod` was initially designed on Snowflake and is now primarily tested on Databricks. Based on my experience and user reports, it is known to work on:
135273
- Snowflake
136-
- BigQuery
137-
- Redshift ([RA3 nodes](https://aws.amazon.com/redshift/features/ra3/) are required to query across databases)
138274
- Databricks
275+
- BigQuery
276+
- Redshift (you may need [RA3 nodes](https://aws.amazon.com/redshift/features/ra3/) for cross-database queries)
277+
- Azure Synapse
139278

140-
It should also work with community-supported adapters that specify a target database and schema - PRs are welcome if it doesn't!
279+
It should also work with community-supported adapters that specify a target database or schema in `profiles.yml`.

0 commit comments

Comments
 (0)