Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy into functionality through dbt cloud #319

Open
Mounika982 opened this issue Sep 20, 2024 · 0 comments
Open

Copy into functionality through dbt cloud #319

Mounika982 opened this issue Sep 20, 2024 · 0 comments

Comments

@Mounika982
Copy link

I'm reaching out for @joellabes assistance with an issue we're experiencing with dbt and Snowflake.
We are attempting to use the Snowflake COPY command in dbt to load data from named stage files and have encountered the following scenarios:

  • When using the COPY command directly in Snowflake, it processes only the latest files. For example, if I have 3 files in the internal stage and all 3 are processed, then 2 new files are added, the COPY command should process only the remaining 2 files as expected as it indicates the same functionality in snowflake UI.

  • When we try to execute the COPY command using a macro in dbt, the first run processes the initial 3 files, and after adding the 2 new files, the command processes all 5 files again. This behaviour is not as expected.

  • Tried with different materializations as well when digged into the problem , we're encountering a problem with data duplication.

  • When we use the "INCREMENTAL" materialization, the data is loaded successfully into the table, but then an INSERT query is performed from a temporary view, duplicating the data. Similarly, when we use the "TABLE" materialization, the table is recreated, wiping out the load metadata, and the data is reloaded, resulting in duplication.

  • We've identified the root cause of the issue and believe that modifying the macro to skip the unnecessary INSERT query in "INCREMENTAL" materialization or using CREATE TABLE IF NOT EXISTS in "TABLE" mode could resolve the issue.

      Could you please provide guidance on how to adapt the queries being sent from dbt to Snowflake to avoid data duplication? We'd appreciate any assistance or recommendations you can offer to resolve this issue. Below is the code used for loading the data from named stage to a target table.
Macro :
{% macro copy_files(stage_name, target_schema,table_name, pattern) %}
COPY INTO {{ target_schema }}.{{ table_name }}
FROM @{{ stage_name }}
pattern = '{{ pattern }}'
FILE_FORMAT = (FORMAT_NAME= 'my_csv_format',error_on_column_count_mismatch=false)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
force=false;

{% endmacro %}

sql:

{{
config(
materialized= 'table', or 'incremental'
target_table = 'active_teller_accounttypes_full',

    pre_hook= [
        
        "{{ copy_files('dm_db.edw_stage.active_teller', 
        'dm_db.dbt_anbtx',
        'active_teller_accounttypes_full', 
         '.*ACTIVE_TELLER_ACCOUNTTYPES_.*\.csv') }}"
    ]
)

}}

select * from {{this}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant