Skip to content

Using Materializations with Dremio

Ravjot Brar edited this page May 14, 2024 · 3 revisions

Seeds

For an overview of what seeds are, refer to Seeds in the dbt documentation.

Where to Store Them

  • Object storage
  • Scratch storage
Configuration Description
object_storage_source The name of the filesystem in which to store seeds. The dbt alias is datalake.
The default value is $scratch.
  • In Dremio Software, the $scratch location points by default to the /scratch directory under Dremio's configured distributed cache location (paths.dist property in dremio.conf). There are no security protocols or permissions associated with this location -- it is readable and writable by all users.
  • If you are using Dremio Cloud, $scratch always points to the /scratch directory. You can find more information about this directory here.
Non-default values correspond to the name of a source in the Object Storage section of the Datasets page in Dremio: Object Storage section of the Datasets page in Dremio
object_storage_path The path in the filesystem in which to store seeds. The dbt alias is root_path.
The default is the root level of the filesystem, represented as the value no_schema. Nested folders in the path are separated with periods.
Non-default values correspond to the path in this location in the Datasets page in Dremio: Path shown on the Datasets page in Dremio

Naming Seeds

By default, seeds are given the same name as the file used to create them. Use the file configuration to give a seed a non-default name.

Materializations

Tables

A Dremio table (also known as a physical dataset) is a table materialization that is either created directly from a source or is created from a seed.

Where to Create Them

  • Object storage
  • Scratch storage
Configuration Description
object_storage_source The name of the filesystem in which to create tables. The dbt alias is datalake.
The default value is $scratch.
  • In Dremio Software, the $scratch location points by default to the /scratch directory under Dremio's configured distributed cache location (paths.dist property in dremio.conf). There are no security protocols or permissions associated with this location -- it is readable and writable by all users.
  • If you are using Dremio Cloud, $scratch always points to the /scratch directory. You can find more information about this directory here.
Non-default values correspond to the name of a source in the Object Storage section of the Datasets page in Dremio: Object Storage section of the Datasets page in Dremio
object_storage_path The path in the filesystem in which to create tables. The dbt alias is root_path.
The default is the root level of the filesystem, represented as the value no_schema. Nested folders in the path are separated with periods.
Non-default values correspond to the path in this location in the Datasets page in Dremio: Path shown on the Datasets page in Dremio

Naming Tables

By default, tables are given the same name as the model used to create them. Use the file configuration to give a table a non-default name. For example, a model named my_basic_table.sql creates by default a table named my_basic_table. With the file configuration, you could create instead a table named my_differently-named-basic-table.

(Optional) Formatting Tables

The format configuration is an optional configuration that determines the format of a persisted table.

For table materializations created from sources, you can specify the format configuration in a project file or a property file. For table materializations created from seeds, you can specify the format configuration in a project file, a property file or a config block. See Model Confgurations in the dbt documentation.

Possible values for both types of table materializations:

  • iceberg: Formats persisted tables as Apache Iceberg tables.
  • json: Formats persisted tables as JSON files.
  • parquet: Formats persisted tables as Parquet files.
  • text: Formats persisted tables as CSV files.

Additional possible values for table materializations created directly from sources:

  • delta: Formats source tables as Delta Lake tables.
  • excel: Formats source tables as Microsoft Excel files.

Additional Settings for the excel Value

If using the excel format, there are additional settings; however, these do not show in PDS but rather only applied when running the VDS generated from the twin strategy.

Setting Description
extract_header Extract the column names from the first line of the file.
xls If the file is in the xls format instead of the xlsx format. Note: This option only works for xls files.
has_merged_cells Expands cells that have been merged in the Excel sheet.
Additional Settings for the text Value

If using the text format, there are additional settings; however, these do not show in PDS but rather only applied when running the VDS generated from the twin strategy.

Settings Description
field_delimiter Field Delimiter Character
line_delimiter Line Delimiter Character
quote Character used to quote. e.g. if the field delimiter character is , and the quote character is ", using "hello, there" would ensure hello and there would not be separated.
comment Character used for comments.
skip_first_line Skip the first line of the file.
extract_header Extract the column names from the first line of the file.
trim_header Trims column names to a currently unknown number of characters.
auto_generated_column_names Create automatic columns if no column names are specified.

(Optional) Partitioning Tables

The partitioning configuration partitions a table materialization.

Option Value Default Comment
partition_by Names of columns, separated by commas N/A Mutually exclusive with distribute_by
partition_method striped, hash, roundrobin striped Valid only when used with partition_by
distribute_by Names of columns, separated by commas N/A Mutually exclusive with partition_by
localsort_by Names of columns, separated by commas N/A Used for sorting within each partition

Example of Using partition_by

{{ config(
    materialized='table', 
    partition_by='cd_credit_rating',
    localsort_by='cd_marital_status',
    partition_method='striped'
) }}

Example of Using partition_method

{{ config(
    materialized='table', 
    partition_by='cd_credit_rating',
    localsort_by='cd_marital_status',
    partition_method='roundrobin'
) }}

Example of Using distribute_by

{{ config(
    materialized='table', 
    distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
    localsort_by='cd_marital_status'
) }}

Examples of localsort_by

{{ config(
    materialized='table', 
    partition_by='cd_credit_rating',
    localsort_by='cd_marital_status',
    partition_method='hash'
) }}
{{ config(
    materialized='table', 
    distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
    localsort_by='cd_marital_status'
) }}

Incremental Materializations

For an overview of what incremental models are, see Incremental Models in the dbt documentation.

Where to Create Them

  • Object storage
  • Scratch storage
Configuration Description
object_storage_source The name of the filesystem in which to create tables from incremental models. The dbt alias is datalake.
The default value is $scratch.
  • In Dremio Software, the $scratch location points by default to the /scratch directory under Dremio's configured distributed cache location (paths.dist property in dremio.conf). There are no security protocols or permissions associated with this location -- it is readable and writable by all users.
  • If you are using Dremio Cloud, $scratch always points to the /scratch directory. You can find more information about this directory here.
Non-default values correspond to the name of a source in the Object Storage section of the Datasets page in Dremio: Object Storage section of the Datasets page in Dremio
object_storage_path The path in the filesystem in which to create tables from incremental models. The dbt alias is root_path.
The default is the root level of the filesystem, represented as the value no_schema. Nested folders in the path are separated with periods.
Non-default values correspond to the path in this location in the Datasets page in Dremio: Path shown on the Datasets page in Dremio

Naming Incremental Materializations

By default, incremental materializations are given the same name as the model used to create them. Use the file configuration to give an incremental materialization a non-default name. For example, a model named my_basic_incremental.sql creates by default an incremental materialization named my_basic_incremental. With the file configuration, you could create instead an incremental materialization named my_differently-named-basic-incremental.

Specifying How to Increment

The optional incremental_strategy config tells the incremental model how to insert data into tables that are already built from the model.

  • append: Insert new records without updating or overwriting any existing data. This is the option that is set by default.
  • merge: Insert new records and update existing records that match the unique ID provided.

Specifying What to Do When a Schema Changes

The on_schema_change configuration specifies how to update a table built from an incremental model when one or more columns are added to or removed from a schema.

The possible values are:

  • sync_all_columns
  • append_new_columns
  • fail
  • ignore

(Optional) Formatting Incremental Materializations

The format configuration is an optional configuration that determines the format of an incremental materialization. You can specify the format configuration in a project file, a property file or a config block. See Model Confgurations in the dbt documentation.

  • iceberg: Formats an incremental materialization as an Apache Iceberg table. This is the only option and is set by default.

(Optional) Partitioning Incremental Materializations

The partitioning configuration partitions an incremental materialization.

Option Value Default Comment
partition_by Names of columns, separated by commas N/A Mutually exclusive with distribute_by
partition_method striped, hash, roundrobin striped Valid only when used with partition_by
distribute_by Names of columns, separated by commas N/A Mutually exclusive with partition_by
localsort_by Names of columns, separated by commas N/A Used for sorting within each partition

Example of Using partition_by

{{ config(
    materialized='table', 
    partition_by='cd_credit_rating',
    localsort_by='cd_marital_status',
    partition_method='striped'
) }}

Example of Using partition_method

{{ config(
    materialized='table', 
    partition_by='cd_credit_rating',
    localsort_by='cd_marital_status',
    partition_method='roundrobin'
) }}

Example of Using distribute_by

{{ config(
    materialized='table', 
    distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
    localsort_by='cd_marital_status'
) }}

Examples of localsort_by

{{ config(
    materialized='table', 
    partition_by='cd_credit_rating',
    localsort_by='cd_marital_status',
    partition_method='hash'
) }}
{{ config(
    materialized='table', 
    distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
    localsort_by='cd_marital_status'
) }}

Views

A Dremio view (also known as a virtual dataset) is a view materialization that is based on a Dremio table.

Where to Create Them

In Dremio spaces.

Configuration Description
dremio_space The value of the Dremio space in which to create views. The dbt alias is database.
The default value is @<username>, where <username> is the username in Dremio.
Non-default values correspond to the name in this location in the Spaces section of the Datasets page in Dremio: Name of a space in Dremio
dremio_space_folder The folder in the Dremio space in which to create views. The dbt alias is schema.
The default is the root level of the space, and is represented by the value no_schema.
Non-default values correspond to the path in this location in the Datasets page in Dremio: Path within a space in Dremio Nested folders are separated with periods. Do not include the root no_schema in the paths of views.

Paths in space are recursive, like filesystem folders : dbt.internal."my folder's name". Dots are not allowed in folder names. The root level of a space is represented as no_schema. To materialize a model contained in the folder my_dbt_project/models/examples at the root folder of the space mySpace, you would set the examples configuration in my_dbt_project/dbt_project.yml like this:

models:
  my_dbt_project:
    examples:
      +dremio_space: mySpace
      +space_folder: no_schema

Note that you can also use the equivalent dbt terminology i.e. +database: mySpace, +schema: no_schema

Folders that do not exist are created if the user running a model has the appropriate permission. However, the space must already exist in Dremio before the model is run.

Do not include the root folder no_schema in the paths of views.

Dremio accepts almost any string character in the objects' names. Therefore, the adapter double-quotes each part of the space.folder.name path of a view. If you specify nested folders in the path of a view, each folder's name is double-quoted. Example: "space"."folder"."sub-folder"."sub-sub-folder"."identifier"

Naming Views

By default, views are given the same name as the model used to create them. Use the alias configuration to give a view a non-default name. For example, a model named my_basic_view.sql creates by default a view named my_basic_view. With the alias configuration, you could create instead a view named my_differently-named-basic-view.

Reflections

A reflection is a materialization that corresponds to a reflection in Dremio.

Where to Create Them

You do not specify where to create reflections. Reflections are automatically created in Dremio's reflection store.

Configuring Reflections

Add the following to bottom of your dbt_project.yml file:

vars:
      dremio:reflections_enabled: true

The table of configuration options below can be set under vars in your dbt_project.yml file. They can also be set in the reflection model’s sql file. In these two examples, my_anchor is a placeholder for the name of the dataset that the reflection is built from.

Example 1

{{ config(materialized='reflection', 
          reflection_type='aggregate', 
          dimensions=['Datetime0'],
          measures=['Num3', 'Int2'], 
          computations=['COUNT','COUNT'])}}
-- depends_on: {{ ref('my_anchor') }}

Example 2

{{ config(materialized='reflection', 
reflection_type='aggregate', 
dimensions=['Datetime0'], 
dimensions_by_day=['Datetime0'], 
measures=['Num3', 'Int2'], 
computations=['MIN,SUM','COUNT'], 
arrow_cache=true) }}

-- depends_on: {{ ref('my_anchor') }}

The model definition will not contain a SELECT statement, but a simple :

 -- depends_on: {{ ref('my_anchor') }}
Configuration Reflection type Value Default
reflection_type Both raw, aggregate raw
display raw Names of columns to include in a reflection, separated by commas All columns
dimensions aggregate Names of columns to use as dimension columns, separated by commas All non-decimal, float, and double columns
measures aggregate Names of columns to use as measure columns, separated by commas All decimal, float, and double columns
computations aggregate The list of computations to use. The supported computations are SUM, MIN, MAX, COUNT, and APPROX_COUNT_DISTINCT. This list maps directly to the list of measures. SUM, COUNT for each measure

(Optional) Partitioning Reflections

The partitioning configuration partitions a reflection.

Option Value Default Comment
partition_by Names of columns, separated by commas N/A
partition_method striped, consolidated striped
localsort_by Names of columns, separated by commas N/A Used for sorting within each partition

Example of partition_by

{{ config(
    materialized='table', 
    partition_by='cd_credit_rating',
    localsort_by='cd_marital_status',
    partition_method='striped'
) }}
{{ config(
    materialized='reflection', 
    reflection_type='raw', 
    partition_by='cd_credit_rating'
) }}

Example of partition_method for Reflections

{{ config(
    materialized='reflection', 
    reflection_type='raw', 
    partition_by='cd_credit_rating',
    partition_method='consolidated'
) }}

(Optional) Twin-Strategy Configuration

Dremio tables can be created only in object storage and Dremio views can be created only in Dremio spaces. Therefore, the adapter uses a twin strategy to determine how to handle creating a view with the same name as an existing table and vice versa.

Configuration Applicable Materializations Possible Values Default
twin_strategy every materialization but reflection allow, prevent, clone clone

The allow Strategy

Allows materializations of different types to have the same names. For example, if the materialized view "mySpace"."myFolder"."customers" exists in a space, it is possible to materialize the table "mySource"."myFolder"."customers" in object storage. And vice-versa.

The prevent Strategy

Prevents materializations of different types from having the same names. For example, if the materialized view "mySpace"."myFolder"."customers" exists in a space, then materializaing the table "mySource"."myFolder"."customers" in object storage drops "mySpace"."myFolder"."customers". And vice-versa.

The clone Strategy

  • If you materialize a table that has the same name as an existing view, the definition of the view is changed to a select * on the table. For example, if the materialized view "mySpace"."myFolder"."customers" exists in a space, and you materialize the table "mySource"."myFolder"."customers" in object storage, the definition of the view changes to select * from {{ "mySource"."myFolder"."customers" }}.
  • If you materialize a view that has the same name as an existing table, the view is defined as a select * on the table. For example, if the materialized table "mySource"."myFolder"."customers" exists in object storage, and you materialize the view "mySpace"."myFolder"."customers", the view is defined as select * from {{ "mySource"."myFolder"."customers" }}.

Example of Setting the Twin Strategy in a Model's config Block

{{ config(materialized='table', twin_strategy='prevent') }}

Example of Setting the Twin Strategy in a Project's YAML File

models:
 test_proj:
  # Config indicated by + and applies to all files under models/example/
  example:
   +materialized: view
   +twin_strategy: prevent

Querying External Sources

Dremio can send native SQL queries to external sources set up in Dremio as database sources. Such queries are known as "External Queries".

External queries can be set in models with the following materializations:

  • view
  • table
  • incremental

Prerequisite Steps

  1. Ensure that the external_query configuration option is set to true in the config block in the model. The SQL being ran must not use ref() or source(), instead only [schema.]table paths.
  2. Set which Database Source the external query uses, at least one -- depends_on : {{ source("<source_name>", "<table_name>") }} must be added to the model's SQL.
  3. Create a schema.yml file that specifies the source and source tables.

Example Model

{{ config(
    external_query="true",
    sql_header="/*External Query*/"
) }}

SELECT *
FROM public.countries

-- depends_on : {{ source( "external_source", "external_table") }}

Example schema.yml file

version: 2

sources:
    - name: external_source
      database: "postgres"
      tables:
          - name: external_table

If you were to use these two examples, running the model would cause dbt to run this query in Dremio:

create or replace view "@dremio"."select_external_query" as (
    select *
      from table("postgres".external_query('

SELECT *
FROM public.countries

-- depends_on : "postgres"."external_source"."external_table"'))
  )

Using Multiple Environments

A same dremio installation could handle several data environments (dev, prod, etc.). Here is how the profile will look with two different object storage paths:

multiple_targets:
  outputs:
    dev:
      dremio_space: '@user'
      dremio_space_folder: no_schema
      object_storage_path: firstpath
      object_storage_source: myobjectstorage
      password: 
      port: 9047
      software_host: my-test-host
      threads: 1
      type: dremio
      use_ssl: false
      user: 
    prod:
      dremio_space: '@user'
      dremio_space_folder: no_schema
      object_storage_path: secondpath
      object_storage_source: myobjectstorage
      password: 
      port: 9047
      software_host: my-test-host
      threads: 1
      type: dremio
      use_ssl: false
      user: 
  target: dev 

Pass in --target prod in the dbt run command to choose the prod target, since dev is the default in the profile above:

 `dbt run --target prod`