Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Update last row key in write function of user stream to avoid data loss and data duplication #1459

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

danieljbruce
Copy link
Contributor

@danieljbruce danieljbruce commented Jul 26, 2024

Summary:

This PR offers a long term fix ensuring that when retryable errors occur that the user will not experience data loss nor will duplicate data be delivered to them. It replaces the patch that is currently in place that is preventing data duplication.

Background

In this PR a fix was applied to prevent data loss in the client for readRows calls. Data duplication was also mostly addressed by adjusting watermarks except for a special case that occurred from time to time in Node v14. The patch in this PR was then applied to ensure the user didn't receive duplicate data by throwing away duplicate data just before it reached the user. Throwing away the duplicate data solves the problem, but it isn't the best permanent solution because it involves two changes to solve just one fundamental problem. This current PR replaces the patch with a simpler change that will also solve the data duplication and data loss issues.

Root cause analysis

While investigating the issue it was found that duplicated data was delivered when a row made it to the write function of the user stream, but not to the transform function of the user stream when the retry request was being formed. It was being stored here in the stream waiting to be processed by transform. Since the transform function was where the last row key was being updated and this hadn't happened yet for the row, the row was re-requested. By pushing the row key update back to the write function we ensure the row doesn't get re-requested.

Changes

  • The old test for data duplication would only fail sometimes on Node v14. This PR adds a test that always fails on Node v14 when the patch isn't in place. This will make it easier to detect data duplication and data loss if it ever happens again
  • Moved the lastRowKey and rowsRead updates to the write function instead of the transform function. When observing the call stack we can see that the emit function of the row stream directly calls the write function in the user stream demonstrating that the write function is the earliest point in time the row is guaranteed to make it to the user and therefore is where the update should occur.
  • Removed the patch in the user stream transform function as it is no longer required there.

Alternatives considered

Instead of overriding write we could pass in a write function to the constructor, but we don't want to do that because it will override _write and stop calling _transform if the Transform is ready for reading. Other options in the Transform constructor don't allow us to achieve our goal of updating the last row key earlier. Therefore, overriding write is our best option.

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigtable Issues related to the googleapis/nodejs-bigtable API. labels Jul 26, 2024
@danieljbruce danieljbruce added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 29, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 29, 2024
@danieljbruce danieljbruce added the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 29, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 29, 2024
@danieljbruce danieljbruce changed the title fix: TODO fix: Update last row key in write function of user stream to avoid data loss and data duplication Jul 29, 2024
@danieljbruce danieljbruce marked this pull request as ready for review July 29, 2024 17:11
@danieljbruce danieljbruce requested review from a team as code owners July 29, 2024 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/nodejs-bigtable API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants