External materialized model with `per_thread_output` parameter does not work #492

elephantmetropolis · 2024-12-17T12:27:08Z

Hello, I am running into an issue when using the external materialization with per_thread_output option.

This option is supposed to create a number of files based on the number of threads used by DuckDB.

When this parameter is passed along with external materialization, I receive an error:

HTTP Error: Unable to connect to URL "https://my.s3/my_bucket/my_path": 404 (Not Found)

The funny thing is the files are actually written to S3 in the right place. And the current model is crashing, not the downstream model. For example, if I have model_b depending on model_a, the run is as follow.

11:04:12  1 of 2 START sql external model main.table_a ............... [RUN]
11:04:22  1 of 2 ERROR creating sql external model main.model_a ...... [ERROR in 9.70s]
11:04:22  2 of 2 SKIP relation main.model_b ................... [SKIP]

After some research I found that part of the code:

dbt-duckdb/dbt/adapters/duckdb/impl.py

Line 173 in 6b539a6

if rendered_options.get("partition_by"):

From what I understand, only models created with partition_by parameter can use file glob, and I suppose it's the problem here.

Is this assumption correct ? If yes, is it possible to add support for the parameter per_thread_output or is it better to manage this issue in a more "parameter agnostic" way ?

I'm not an expert but if I can contribute in any way I would appreciate.

Edit: I tried to modify impl.py and it seems this works on my side : #493

Edit 2: After more investigation, it seems the parameter per_thread_output is not working properly for external materializations. I tested to write to S3 using post-hook and macro and this code works:

{% macro export_to_s3(table) %}
{% set s3_path = 's3://my_bucket' %}
    COPY {{ table }} 
    TO '{{ s3_path }}/{{ table.name }}'
    (FORMAT PARQUET, PER_THREAD_OUTPUT TRUE);
{% endmacro %}

But using per_thread_output directly in {{ config() }} will end up with DuckDB writing one big file to S3.

For info, I set DuckDB to use 30 threads and 300GB ram.

However, the following code doesn't work properly, only one file is created, but, the file is created inside subfolder, which makes sense regarding expected PER_THREAD_OUTPUT behavior:

{{
    config(
        materialized='external',
        format='parquet'
        options={
            "per_thread_output": 1
        }
    )
}}
...

It seems there is two issues here, one being the per_thread_output not taken into account with file glob and the second is the per_thread_output parameter behaving weirdly (only using one thread ?).

The text was updated successfully, but these errors were encountered:

jwills · 2024-12-18T00:05:49Z

@elephantmetropolis thanks for the report here, this is a fun one-- taking a look

elephantmetropolis mentioned this issue Dec 17, 2024

add support for per_thread_output in external materialization #493

Merged

elephantmetropolis changed the title ~~External materialized model with multiple files does not work~~ External materialized model with per_thread_output does not work Dec 17, 2024

elephantmetropolis changed the title ~~External materialized model with per_thread_output does not work~~ External materialized model with per_thread_output parameter does not work Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External materialized model with `per_thread_output` parameter does not work #492

External materialized model with `per_thread_output` parameter does not work #492

elephantmetropolis commented Dec 17, 2024 •

edited

Loading

jwills commented Dec 18, 2024

External materialized model with per_thread_output parameter does not work #492

External materialized model with per_thread_output parameter does not work #492

Comments

elephantmetropolis commented Dec 17, 2024 • edited Loading

jwills commented Dec 18, 2024

External materialized model with `per_thread_output` parameter does not work #492

External materialized model with `per_thread_output` parameter does not work #492

elephantmetropolis commented Dec 17, 2024 •

edited

Loading