upstream-prod is a dbt package for easily using production data in a development environment. It's a hands-off alternative to the defer flag - only without the need to find and download a production manifest - and was inspired by similar work by Monzo.
In a typical project, prod and dev models are materialised in separate environments. Although this ensures end users are unaffected by ongoing development, there's a significant downside: the isolation means that each environment needs a complete, up-to-date copy of every model. This can be challenging for complex projects or long-running models, and out-of-date data can cause frustrating errors.
upstream-prod solves this by intelligently redirecting refs to prod outputs. It is highly adaptable and can be used whether your environments are in separate schemas, databases, or a combination of both. On most warehouses it can even compare dev and prod outputs and use the most recently-updated relation.
microbatch on Snowflake with dbt Fusion is supported but may be fragile. Fusion's microbatch handling is still a work in progress. It doesn't apply batch filters to refs the way dbt-core does (dbt-labs/dbt-fusion#1608), so upstream-prod reconstructs the filter manually from model.batch and the parent's event_time. This works on Snowflake Fusion today, but it leans on Fusion-internal state with no stability guarantees and could break in future Fusion releases.
Don't use microbatch on Databricks / BigQuery Fusion. Builds crash inside the adapter's incremental materialization (adapter.clean_sql(model['compiled_code']) resolves to undefined during the per-batch render). This reproduces with a vanilla microbatch model — no upstream-prod involvement — so it's an upstream Fusion gap rather than something upstream-prod can patch around. Exclude microbatch from your Fusion runs:
dbt run --exclude config.incremental_strategy:microbatchMicrobatch with upstream-prod works fully on dbt-core for every supported adapter.
ℹ️ If you need help setting up the package, please create an issue or tag / DM @LewisDavies on the dbt Slack.
The package relies on a few variables that indicate where prod data is available. The exact requirements depend on your setup; use the questions below to find the correct variables for your project.
Add upstream_prod to packages.yml, then run dbt deps:
# packages.yml
packages:
- package: LewisDavies/upstream_prod
version: 0.10.3If you aren't sure, check your macros directory for a macro called generate_schema_name. The exact filename may differ - dbt's docs call it get_custom_schema.sql - so you may need to check the file contents.
Your platform may use a different term, such as catalog on Databricks or project on BigQuery.
| Custom schema macro | No custom schema macro | |
|---|---|---|
| Dev & prod in same database | Setup A | Setup B |
| Dev & prod in different databases | Setup C | Setup D |
Setup A
Your custom schema macro needs two changes:
- Add an
is_upstream_prod=Falseparameter to the macro signature. - In every condition that selects your production schema naming path, add
or is_upstream_prod is true— and make sure the resultingorgroup is wrapped in brackets.
For example, the macro below shows how to adapt the built-in generate_schema_name_for_env macro:
-- 1. Add an is_upstream_prod parameter that defaults to False
{% macro generate_schema_name(custom_schema_name, node, is_upstream_prod=False) -%}
{%- set default_schema = target.schema -%}
-- 2. In the clause that generates your prod schema names, add a check that the value is True
-- **Make sure to enclose the or condition in brackets**
{%- if (target.name == "prod" or is_upstream_prod is true) and custom_schema_name is not none -%}
{{ custom_schema_name | trim }}
{%- else -%}
{{ default_schema }}
{%- endif -%}
{%- endmacro %}Add the values below to the vars section of dbt_project.yml. Some optional variables are included to improve your experience:
upstream_prod_fallbacktells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.upstream_prod_prefer_recentcompares when the prod and dev relations were last modified and returns the most recent. This is only available on Snowflake, Databricks & BigQuery.upstream_prod_disabled_targetsis used to bypass the package in certain environments. It is highly recommended to disable the package for prod runs.
# dbt_project.yml
vars:
# Required
upstream_prod_env_schemas: true
# Optional, but recommended
upstream_prod_fallback: true
upstream_prod_prefer_recent: true
upstream_prod_disabled_targets:
- prodSetup B
Add the values below to the vars section of dbt_project.yml. Some optional variables are included to improve your experience:
upstream_prod_fallbacktells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.upstream_prod_prefer_recentcompares when the prod and dev relations were last modified and returns the most recent. This is only available on Snowflake, Databricks & BigQuery.upstream_prod_disabled_targetsis used to bypass the package in certain environments. It is highly recommended to disable the package for prod runs.
# dbt_project.yml
vars:
# Required
upstream_prod_schema: <prod_schema_name/prefix>
# Optional, but recommended
upstream_prod_fallback: true
upstream_prod_prefer_recent: true
upstream_prod_disabled_targets:
- prodSetup C
Your custom schema macro needs two changes:
- Add an
is_upstream_prod=Falseparameter to the macro signature. - In every condition that selects your production schema naming path, add
or is_upstream_prod is true— and make sure the resultingorgroup is wrapped in brackets.
For example, the macro below shows how to adapt the built-in generate_schema_name_for_env macro:
-- 1. Add an is_upstream_prod parameter that defaults to False
{% macro generate_schema_name(custom_schema_name, node, is_upstream_prod=False) -%}
{%- set default_schema = target.schema -%}
-- 2. In the clause that generates your prod schema names, add a check that the value is True
-- **Make sure to enclose the or condition in brackets**
{%- if (target.name == "prod" or is_upstream_prod is true) and custom_schema_name is not none -%}
{{ custom_schema_name | trim }}
{%- else -%}
{{ default_schema }}
{%- endif -%}
{%- endmacro %}Add the values below to the vars section of dbt_project.yml. Some optional variables are included to improve your experience:
upstream_prod_fallbacktells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.upstream_prod_prefer_recentcompares when the prod and dev relations were last modified and returns the most recent. This is only available on Snowflake, Databricks & BigQuery.upstream_prod_disabled_targetsis used to bypass the package in certain environments. It is highly recommended to disable the package for prod runs.
# dbt_project.yml
vars:
# Required
upstream_prod_database: <prod_database_name>
upstream_prod_env_schemas: true
# Optional, but recommended
upstream_prod_fallback: true
upstream_prod_prefer_recent: true
upstream_prod_disabled_targets:
- prodUsing a custom database macro?
There are two more steps if your project has a custom generate_database_name macro.
First, add upstream_prod_env_dbs: true to dbt_project.yml.
Then apply the same two changes to your custom database macro:
- Add an
is_upstream_prod=Falseparameter to the macro signature. - In every condition that selects your production database naming path, add
or is_upstream_prod is true— and make sure the resultingorgroup is wrapped in brackets.
For example:
-- 1. Add an is_upstream_prod parameter that defaults to False
{% macro generate_database_name(custom_database_name=none, node=none, is_upstream_prod=False) -%}
{%- set default_database = target.database -%}
-- 2. In the clause that generates your prod database names, add a check that the value is True
-- **Make sure to enclose the or condition in brackets**
{%- if (target.name == "prod" or is_upstream_prod is true) and custom_database_name is not none -%}
{{ custom_database_name | trim }}
{%- else -%}
{{ default_database }}
{%- endif -%}
{%- endmacro %}Setup D
Add the values below to the vars section of dbt_project.yml. Some optional variables are included to improve your experience:
upstream_prod_fallbacktells the package to return your dev relation if the prod version can't be found. This is very useful when creating multiple models at the same time.upstream_prod_prefer_recentcompares when the prod and dev relations were last modified and returns the most recent. This is only available on Snowflake, Databricks & BigQuery.upstream_prod_disabled_targetsis used to bypass the package in certain environments. It is highly recommended to disable the package for prod runs.
# dbt_project.yml
vars:
# Required
upstream_prod_database: <prod_database_name>
upstream_prod_schema: <prod_schema_name/prefix>
# Optional, but recommended
upstream_prod_fallback: true
upstream_prod_prefer_recent: true
upstream_prod_disabled_targets:
- prodUsing a custom database macro?
There are two more steps if your project has a custom generate_database_name macro.
First, add upstream_prod_env_dbs: true to dbt_project.yml.
Your custom database macro needs two changes:
- Add an
is_upstream_prod=Falseparameter to the macro signature. - In every condition that selects your production database naming path, add
or is_upstream_prod is true— and make sure the resultingorgroup is wrapped in brackets.
For example:
-- 1. Add an is_upstream_prod parameter that defaults to False
{% macro generate_database_name(custom_database_name=none, node=none, is_upstream_prod=False) -%}
{%- set default_database = target.database -%}
-- 2. In the clause that generates your prod database names, add a check that the value is True
-- **Make sure to enclose the or condition in brackets**
{%- if (target.name == "prod" or is_upstream_prod is true) and custom_database_name is not none -%}
{{ custom_database_name | trim }}
{%- else -%}
{{ default_database }}
{%- endif -%}
{%- endmacro %}In your macros directory, create a file called ref.sql with the following contents:
{% macro ref(
parent_arg_1,
parent_arg_2=None,
prod_database=var("upstream_prod_database", None),
prod_schema=var("upstream_prod_schema", None),
enabled=var("upstream_prod_enabled", True),
fallback=var("upstream_prod_fallback", False),
env_schemas=var("upstream_prod_env_schemas", False),
version=None,
prefer_recent=var("upstream_prod_prefer_recent", False),
env_dbs=var("upstream_prod_env_dbs", False)
) %}
{% do return(upstream_prod.ref(
parent_arg_1,
parent_arg_2,
prod_database,
prod_schema,
enabled,
fallback,
env_schemas,
version,
prefer_recent,
env_dbs
)) %}
{% endmacro %}Alternatively, you can find any instances of {{ ref() }} in your project and replace them with {{ upstream_prod.ref() }}. This is suitable for testing the package but is not recommended for general use.
If you use the dbt Power User VS Code extension, set its dbt integration mode to corecommand:
// .vscode/settings.json
"dbt.dbtIntegration": "corecommand"The default core mode uses a custom dbt runner that doesn't play nicely with the upstream_prod_prefer_recent setting. Using the corecommand mode makes the extension use CLI commands instead, meaning this package always compares prod & dev relations when running preview queries.
Run dbt compile -s <any_model_with_refs> and open the compiled SQL in target/compiled/…. Refs to models you haven't built locally should resolve to your production schema; refs to models you have built should resolve to your dev schema. If refs still point at empty dev schemas, check that your ref.sql macro is in macros/ and that upstream_prod_enabled isn't set to false.
Assume your project has an events model that depends on intermediate and staging layers. The simplified DAGs looks like this:
graph LR
source[(Source)]
source -.-> prod_stg[stg_events]
source ==> dev_stg[stg_events]
subgraph prod
prod_stg -.-> prod_int[int_events] -.-> prod_mart[events]
end
subgraph dev
dev_stg ==> dev_int[int_events] ==> dev_mart[events]
end
You want to change int_events, so you need a copy of stg_events in dev. This could be expensive and time-consuming to create from scratch, and it could slow down your development process considerably. Perhaps this model already exists from previous work, but is it up-to-date? If the model definition or underlying data has changed, your dev model may break in prod.
upstream-prod solves this problem by intelligently redirecting refs based on the selected models for the current run. Running dbt build -s int_events+ would:
- Create
dev.int_eventsusing data fromprod.stg_events - Create
dev.eventson top ofdev.int_events, since the package recognises thatint_eventshas been selected - Run tests against
dev.int_eventsanddev.events
Now that your dev models are using prod data, your DAG would look like this:
graph LR
source[(Source)]
source ==> prod_stg[stg_events]
source -.-> dev_stg[stg_events]
subgraph prod
prod_stg -.-> prod_int[int_events] -.-> prod_mart[events]
end
subgraph dev
dev_stg ~~~ dev_int
prod_stg ==> dev_int[int_events] ==> dev_mart[events]
end
upstream-prod is built and tested on Snowflake. Based on my experience and user reports, it is known to work on:
- Snowflake
- Databricks
- BigQuery
- Redshift (you may need RA3 nodes for cross-database queries)
- Azure Synapse
It should also work with community-supported adapters that specify a target database or schema in profiles.yml.