[Bug] Review the incremental logic in iterable__events #45

fivetran-reneeli · 2024-07-25T18:41:02Z

Is there an existing issue for this?

I have searched the existing issues

Describe the issue

While reviewing PR #44 it was discovered that the row count for iterable__events does not match the row count of stg_iterable__event following an incremental run, running with test data on bigquery. We should look into why this is happening, if it's expected behavior, or if we should update the logic.

Relevant error log or model output

No response

Expected behavior

I believe the row count for iterable__events should match the row count of stg_iterable__event

dbt Project configurations

n/a but try with incremental run

Package versions

'0.12.0'

What database are you using dbt with?

bigquery

dbt Version

1.7.9

Additional Context

No response

Are you willing to open a PR to help address this issue?

Yes.
Yes, but I will probably need assistance.
No.

The text was updated successfully, but these errors were encountered:

fivetran-joemarkiewicz · 2024-07-31T20:03:18Z

After exploring this further I believe I was able to get to the bottom of the issue with the iterable__events incremental model.

The crux of the issue resides within how we have structure the incremental logic and the incremental strategies we are using. We're leveraging the insert-overwrite and delete+insert strategies across various warehouses and both of these strategies rely on the partition by logic within the incremental block.

dbt_iterable/models/iterable__events.sql

Line 5 in 0c3b449

    
           partition_by={"field": "created_on", "data_type": "date"} if target.type not in ('spark','databricks') else ['created_on'],

This is appropriate as it helps with the performance of the incremental loads and only operates on the partition in question. However, this comes into conflict with the field we are filtering on to perform the incremental load: created_at.

dbt_iterable/models/iterable__events.sql

Lines 16 to 18 in 0c3b449

    
           {% if is_incremental() %} 
        
           where created_at >= (select max(created_at) from {{ this }} ) 
        
           {% endif %}

The created_at field is a timestamp, but the created_on field is a date. Therefore, we end up partitioning by date (good), but then we are filtering for only records that are greater than or equal to the max timestamp. That means if there are some events in the underlying source table that occurred on the same day (say 2024-07-31), but are less than the max timestamp (say the max timestamp is 2024-07-31T20:00:00, but there are 5 other records on that same day which are 2024-07-31T08:00:00, 2024-07-31T10:00:00, 2024-07-31T12:00:00, 2024-07-31T14:00:00, and 2024-07-31T19:00:00) then the incremental logic with the use of the partition will actually overwrite those five records that occurred on the same day, but are not greater than or equal to the max created_at timestamp. This means there is a risk that subsequent incremental runs of the iterable__events model is erroneously removing events that occurred on the same day.

To address this we will need to take either of the following approaches to update the existing incremental logic (please note that both of these approaches does not involve adjusting the partition logic as we do not want to partition by a timestamp):

Update the incremental logic to be an explicit greater than (not equal to).
- ```
 where created_at > (select max(created_at) from {{ this }} )
```
- I was able to test this and found the previously missing records in our test data were still present in the end table on an incremental run. However, I don't have utmost confidence in this approach.
Update the incremental logic to check the date and not the timestamp. This way we are not ignoring records on the same max day (not timestamp) that fall within the partition window.
- ```
 where cast(created_at as date) >= (select max(cast(created_at as date)) from {{ this }} )
```
- I tested this with some sample data and saw subsequent incremental runs were successful in capturing the additional events that occurred on the same day.

Based on my scoping, I would assume the later approach would be the ideal one. We should also double check the other incremental models in this package to confirm this is not an issue in other models.

fivetran-joemarkiewicz added status:scoping Currently being scoped error:unforced labels Jul 31, 2024

fivetran-joemarkiewicz added status:accepted Scoped and accepted into queue and removed status:scoping Currently being scoped labels Jul 31, 2024

fivetran-reneeli mentioned this issue Oct 25, 2024

fix for events incremental #51

Merged

6 tasks

fivetran-reneeli linked a pull request Oct 29, 2024 that will close this issue

fix for events incremental #51

Merged

6 tasks

fivetran-joemarkiewicz assigned fivetran-reneeli Oct 29, 2024

fivetran-reneeli closed this as completed in #51 Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Review the incremental logic in iterable__events #45

[Bug] Review the incremental logic in iterable__events #45

fivetran-reneeli commented Jul 25, 2024

fivetran-joemarkiewicz commented Jul 31, 2024

[Bug] Review the incremental logic in iterable__events #45

[Bug] Review the incremental logic in iterable__events #45

Comments

fivetran-reneeli commented Jul 25, 2024

Is there an existing issue for this?

Describe the issue

Relevant error log or model output

Expected behavior

dbt Project configurations

Package versions

What database are you using dbt with?

dbt Version

Additional Context

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented Jul 31, 2024