Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Review the incremental logic in iterable__events #45

Closed
2 of 4 tasks
fivetran-reneeli opened this issue Jul 25, 2024 · 1 comment · Fixed by #51
Closed
2 of 4 tasks

[Bug] Review the incremental logic in iterable__events #45

fivetran-reneeli opened this issue Jul 25, 2024 · 1 comment · Fixed by #51
Assignees
Labels
error:unforced status:accepted Scoped and accepted into queue

Comments

@fivetran-reneeli
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

While reviewing PR #44 it was discovered that the row count for iterable__events does not match the row count of stg_iterable__event following an incremental run, running with test data on bigquery. We should look into why this is happening, if it's expected behavior, or if we should update the logic.

Relevant error log or model output

No response

Expected behavior

I believe the row count for iterable__events should match the row count of stg_iterable__event

dbt Project configurations

n/a but try with incremental run

Package versions

'0.12.0'

What database are you using dbt with?

bigquery

dbt Version

1.7.9

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will probably need assistance.
  • No.
@fivetran-joemarkiewicz
Copy link
Contributor

After exploring this further I believe I was able to get to the bottom of the issue with the iterable__events incremental model.

The crux of the issue resides within how we have structure the incremental logic and the incremental strategies we are using. We're leveraging the insert-overwrite and delete+insert strategies across various warehouses and both of these strategies rely on the partition by logic within the incremental block.

partition_by={"field": "created_on", "data_type": "date"} if target.type not in ('spark','databricks') else ['created_on'],

This is appropriate as it helps with the performance of the incremental loads and only operates on the partition in question. However, this comes into conflict with the field we are filtering on to perform the incremental load: created_at.

{% if is_incremental() %}
where created_at >= (select max(created_at) from {{ this }} )
{% endif %}

The created_at field is a timestamp, but the created_on field is a date. Therefore, we end up partitioning by date (good), but then we are filtering for only records that are greater than or equal to the max timestamp. That means if there are some events in the underlying source table that occurred on the same day (say 2024-07-31), but are less than the max timestamp (say the max timestamp is 2024-07-31T20:00:00, but there are 5 other records on that same day which are 2024-07-31T08:00:00, 2024-07-31T10:00:00, 2024-07-31T12:00:00, 2024-07-31T14:00:00, and 2024-07-31T19:00:00) then the incremental logic with the use of the partition will actually overwrite those five records that occurred on the same day, but are not greater than or equal to the max created_at timestamp. This means there is a risk that subsequent incremental runs of the iterable__events model is erroneously removing events that occurred on the same day.

To address this we will need to take either of the following approaches to update the existing incremental logic (please note that both of these approaches does not involve adjusting the partition logic as we do not want to partition by a timestamp):

  • Update the incremental logic to be an explicit greater than (not equal to).
    •  where created_at > (select max(created_at) from {{ this }} )
    • I was able to test this and found the previously missing records in our test data were still present in the end table on an incremental run. However, I don't have utmost confidence in this approach.
  • Update the incremental logic to check the date and not the timestamp. This way we are not ignoring records on the same max day (not timestamp) that fall within the partition window.
    •  where cast(created_at as date) >= (select max(cast(created_at as date)) from {{ this }} )
    • I tested this with some sample data and saw subsequent incremental runs were successful in capturing the additional events that occurred on the same day.

Based on my scoping, I would assume the later approach would be the ideal one. We should also double check the other incremental models in this package to confirm this is not an issue in other models.

@fivetran-joemarkiewicz fivetran-joemarkiewicz added status:accepted Scoped and accepted into queue and removed status:scoping Currently being scoped labels Jul 31, 2024
@fivetran-reneeli fivetran-reneeli linked a pull request Oct 29, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error:unforced status:accepted Scoped and accepted into queue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants