Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSC-STM-B6: Add tracking of which records have been transformed #254

Open
tiredpixel opened this issue Feb 28, 2024 · 0 comments
Open

PSC-STM-B6: Add tracking of which records have been transformed #254

tiredpixel opened this issue Feb 28, 2024 · 0 comments
Assignees

Comments

@tiredpixel
Copy link
Contributor

It is not ideal to process the same records multiple times, since it may keep replacing statements.

When we are consuming from S3, we only transform each file once, and when from a Kinesis stream, we keep track of our stream pointer, so this doesn’t happen much in practice. However, when switching from bulk files over to the Kinesis stream, there is a danger of 48 hours of records or so being processed more than once.

To fix this, it would make sense to keep track of the records transformed in the previous 48 hours, so these can be safely skipped.

  • When a record has been transformed, store the etag of the processed PSC record for some length of time longer than max stream duration (eg store for 48 hours)
  • When transforming a PSC record, first check whether it has been transformed in the last 48 hours.

This will ensure that the same records don’t get processed multiple times in cases of duplicates or during the changeover.

Estimate: 6 hours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant