-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Historical Backfilling and Rebasing of Snapshots #9892
Comments
@dbeatty10 Let me know if this is sufficient for lining up the feature? Happy to provide my scripts as a point of reference. |
Thanks for opening this @walker-philips 🤩 Yes, any additional points of reference (such as your scripts) would be great to see as well. |
@dbeatty10 Where would be the best place to add those? I didnt want to fork just for a few files and I also cannot add my full project due to compliance reasons. |
Here's a few ideas:
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
1 similar comment
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Voicing my support for this one, recently had a similar use case for snapshots needing to be 'backfilled'. |
@walker-philips we're waiting on additional points of reference (such as your scripts) to assess this feature request further. Are you still planning on sharing those? We've got an initiative right now related to snapshots, so it would be good timing for us to consider your ideas if you are able to provide more detail. You can read more here: Without additional information of what this feature request would look like, this is issue is likely to go stale again. |
So here's my assumptions and approach:
My assumption:
The approach works as follows: For each snapshot to be backfilled, we have to create a backfill model that transforms all the historical data into a form that can be unioned with the current snapshot table. It has some necessary criteria that always has to be met. Namely the designation of a unique_key and dbt_valid_from stand in. I utilize pre/post hooks to execute INSERT statements that save the backfill data as its own table disconnected from the dbt framework. This helps avoid circular dependencies. The dbt_backfill_applier_model() is simply boilerplate code that helps determine whether to reuse an instantiated backfill dataset or make it for the first time. Lastly, render_backfills deletes itself as its essentially useless from a database asset standpoint. It has to be a table in order to force the triggering of the post hook and execution of the run_query command Admittedly, this feels super hacky. But they key takeaways I have are:
apply_snapshot_backfill.txt |
Realized I was missing something like this to handle consecutive, repeating data. There's a potential that by just grouping by the check_cols (for a check strategy at least), you would eliminate future, repeated values entirely.
you can then filter away these consecutive rows to prune snapshots and historical data by just adding |
Is this your first time submitting a feature request?
Describe the feature
My understanding: Snapshots are intended to record historical data safely, and avoid accidental deletion of data. They are driven by timestamps or data driven deltas specified by the user. Data driven Snapshots cannot "replay" history, such that a Snapshot query should only return a single row to represent a desired unit of data. As an example, a Snapshot Check Strategy query of Bank Account balances should only show the current day's balance per account as opposed to several days worth of balances per account.
Users of DBT may possess lower/same quality of historical prior to a Snapshot being formally initialized. Currently, there is no way to backfill a Snapshot. Also, a user may change their approach to Snapshotting data, causing mass, unnecessary snapshotting of rows or business rules may change leading to reducing the depth of snapshots. There is no way to "condense" snapshots by removing rows of now-duplicate data and updating the dbt_valid_from/dbt_valid_to columns to remain linear.
Describe alternatives you've considered
I have a rough working approach utilizing a macro, a template model, a controlling model that coordinates the backfill of several Snapshots, pre/post hooks to run the necessary update statements. It does not attempt to alter the existing Snapshot. Instead, you could feed the existing Snapshot as a backfill to a new Snapshot model. In order to promote data survivability, a user can choose to save backfill data as a table to allow recovery.
Who will this benefit?
Any users with historical data they would like to merge into a Snapshot or one who wants to rebase/condense an existing Snapshot due to a change in Snapshot strategy/methodology
Are you interested in contributing this feature?
Yes, I have an example of my imperfect approach
Anything else?
No response
The text was updated successfully, but these errors were encountered: