Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stream): wait committed epoch in state table init_epoch #19223

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

wenym1
Copy link
Contributor

@wenym1 wenym1 commented Nov 1, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

The init_epoch part of #18312.

It is required that after init_epoch returns, we can always read the consistent data on prev_epoch, and otherwise, we will read inconsistent data and pollute the downstream. init_epoch is called during recovery and configuration. Previously, in recovery, this is ensured by CN waiting the latest hummock version, and in configuration change, this is ensured by pausing the barrier injection and do a CN-wide try_wait_epoch call in the WaitEpochCommit streaming service rpc to wait for the committed epoch bumping up globally..

In this PR, in the init_epoch of StateTable, we will change to wait for the committed_epoch of the state table to bump up to the prev_epoch. With this PR, during recovery there is no need for CN to wait for the latest hummock version, and for configuration change without update vnode bitmap (replace table, sink into table, not including scale), there is no need to do the pause-wait-resume barrier injection.

Deadlock can easily happen. Bumping up committed_epoch depends on collecting barriers from all actors, but if init_epoch blocks the handling of initial barrier, when init_epoch waits for bumping up committed_epoch, it blocks barrier handling, and causes committed_epoch never bump up, and cause deadlock.

To avoid deadlock, we should strictly follow the order when handling the initial barrier. The order should be, receive first barrier, yield first barrier, and then StateTable::init_epoch. In this PR, we just ensure that all usages of StateTable::init_epoch will be included in the changed code so that we can carefully review the code. The StateTable::init_epoch method will become async, and therefore all direct usages on it will be included in the changed code. Some non-async utils method that calls StateTable::init_epoch is required to be async as well, and therefore will be included in the changed code as well. There are two utils methods, both named init_epoch, that were already async, and call StateTable::init_epoch. To include the usages of these two methods in the changed code, the two methods are both renamed to init_epoch_after_yield_barrier, so that the usage can also be included in the changed code. Moreover, if the handling order is incorrect, deadlock must happen if the problematic code is reached. Hence, if we can pass all the CI tests, we can be confident that the correctness is fulfilled.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@graphite-app graphite-app bot requested a review from a team November 4, 2024 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants