-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archival: Use read-write fence in the ntp_archiver #24574
base: dev
Are you sure you want to change the base?
archival: Use read-write fence in the ntp_archiver #24574
Conversation
2e39592
to
a392416
Compare
Retry command for Build#59774please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#59774
test results on build#59783
test results on build#59824
test results on build#59868
test results on build#59928
test results on build#59958
test results on build#59971
test results on build#59988
test results on build#60029
test results on build#60039
test results on build#60048
test results on build#60128
test results on build#60147
|
a392416
to
ce0cb32
Compare
Retry command for Build#59783please wait until all jobs are finished before running the slash command
|
ce0cb32
to
06ca1b0
Compare
Retry command for Build#59824please wait until all jobs are finished before running the slash command
|
06ca1b0
to
63eb5f5
Compare
Retry command for Build#59868please wait until all jobs are finished before running the slash command
|
63eb5f5
to
b7836dd
Compare
b7836dd
to
a52b914
Compare
Retry command for Build#59928please wait until all jobs are finished before running the slash command
|
a52b914
to
9cb55f3
Compare
Retry command for Build#59958please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
Retry command for Build#59988please wait until all jobs are finished before running the slash command
|
83a0abc
to
8b33548
Compare
Retry command for Build#60029please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
Retry command for Build#60039please wait until all jobs are finished before running the slash command
|
ba45e29
to
9b0a748
Compare
Retry command for Build#60048please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
The snapshot doesn't serialize the applied-offset field of the manifest. This commit fixes this. Signed-off-by: Evgeny Lazin <[email protected]>
The fence is a value which is produced by recording the current offset of the last applied archival STM command. The fence is added to the archival STM config batch. If no commands were applied to the STM the offset of the last applied command will be the same and the configuration batch will be applied. Otherwise, if some command was applied after the record batch was constructed the batch will not be applied. This is essentially a concurrency control mechamism. The commands are already present in the STM but not used as for now. The existing metadata validation mechanism stays.
The check is no longer needed because we're using fencing mechanism to prevent concurrency. The check is actually performed when the segment is uploaded. The removed check was used as an optimistic concurrency control mechanism and is now replaced by the RW-fence. Also, deprecate the error code that STM uses to communicate the consistency violation to the caller. The error code is not used anywhere else. Signed-off-by: Evgeny Lazin <[email protected]>
The option was never turned on by default. Our plan was to eventually enable it but it is now replaced with fencing mechanism which is always on and can't be disabled. Signed-off-by: Evgeny Lazin <[email protected]>
This commit enables rw-fence mechanism for housekeeping reuploads. In case if the housekeeping is running on a stale version of the STM snapshot the update will be rejected. Signed-off-by: Evgeny Lazin <[email protected]>
Use rw-fence for both normal retention/GC and spillover. Currently, the upload metadata validation only checks added segments. Because of that there is still some room for concurrent updates during the housekeeping (GC or retention). With rw-fence the metadata updates related to retention or GC will become safer. Signed-off-by: Evgeny Lazin <[email protected]>
Signed-off-by: Evgeny Lazin <[email protected]>
Previous implementation of the archival STM had a bug because of which the last applied offset wasn't added to the snapshot. This could potentially make replicas diverge. The solution is to add a feature flag and use the rw-fence mechanism only if all replicas are upgraded. Signed-off-by: Evgeny Lazin <[email protected]>
Signed-off-by: Evgeny Lazin <[email protected]>
9b0a748
to
779bc98
Compare
Split method into two parts: - wait_uploads_complete - replicate_archival_metadata The first method is waiting until all uploads are finished. But unlike wait_uploads it doesn't replicate the metadata. Instead of doing this it just returns all information needed to replicate it. The second method takes the output of the wait_uploads_complete and replicates it. Finally, the wait_uploads method is now implemented using this two new methods. This is a preparation for the next commit. Currently, when we're uploading both compacted and non-compacted segments in the same iteration of the upload loop they both replicated independently. This worked fine before (with the exception of the race condition on _projected_manifest_clean_at). But with the fencing mechanism one of this wait_uploads calls will fail to replicate because of the fence. Currently, the same fence is used to invoke wait_uploads for compacted and non-compacted segments. The solution to this problem is to use wait_uploads_complete to wait until both compacted and non-compacted uploads are completed and then use a single replicate_archival_metadata call. This call will be able to use the fence value correctly. Signed-off-by: Evgeny Lazin <[email protected]>
c52583f
to
aa6837a
Compare
Put non-compacted segment metadata and compacted segment metadata into the same archival control batch. This is needed to make fencing work correctly if both comapcted and non-compacted uploads are made in the same upload loop iteraton. Signed-off-by: Evgeny Lazin <[email protected]>
auto error = co_await _parent.archival_meta_stm()->add_segments( | ||
auto batch_builder = _parent.archival_meta_stm()->batch_start( | ||
deadline, _as); | ||
if (!fence.unsafe_add) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth adding likely()
here for prod?
@@ -468,7 +468,7 @@ struct configuration final : public config_store { | |||
property<std::chrono::milliseconds> | |||
cloud_storage_topic_purge_grace_period_ms; | |||
property<bool> cloud_storage_disable_upload_consistency_checks; | |||
property<bool> cloud_storage_disable_metadata_consistency_checks; | |||
deprecated_property cloud_storage_disable_metadata_consistency_checks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
squash this with previous commit?
_parent.archival_meta_stm()->get_insync_offset()); | ||
builder.read_write_fence(fence.read_write_fence); | ||
} | ||
builder.add_segments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some differences between archival_metadata_stm::add_segments()
and command_batch_builder::add_segments()
, notably that the clean_offset
and highest_producer_id
are added as raw key-values in the former but not the latter. Additionally, there is some logging in archival_metadata_stm::add_segments()
that we will no longer hit.
Are any of these considerations worth revisiting? Can we de-duplicate some code across these functions for sake of clean-up/legibility?
@@ -2661,6 +2668,13 @@ ss::future<> ntp_archiver::apply_archive_retention() { | |||
auto deadline = ss::lowres_clock::now() + sync_timeout; | |||
|
|||
auto batch = _parent.archival_meta_stm()->batch_start(deadline, _as); | |||
if (!fence.unsafe_add) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another likely()
call?
@@ -1041,6 +1041,11 @@ ss::future<std::error_code> archival_metadata_stm::do_add_segments( | |||
} | |||
|
|||
ss::future<> archival_metadata_stm::do_apply(const model::record_batch& b) { | |||
/*TODO: remove*/ vlog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what sort of timeframe is "temporary" in terms of code existing in dev
?
.read_write_fence = fence, | ||
// Only use the rw-fence if the feature is enabled which requires | ||
// major version upgrade. | ||
.unsafe_add = !_feature_table.local().is_active( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in hindsight, is unsafe_add
a good variable name, or would use_rw_fence
or something along those lines be better so we don't need !
everywhere?
/// Replicate archival metadata | ||
ss::future<ntp_archiver::upload_group_result> replicate_archival_metadata( | ||
archival_stm_fence fence, | ||
std::list<wait_uploads_complete_result> finished_uploads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the motivation behind this being an std::list
?
The fence is a value which is produced by recording the current offset of the last applied archival STM command. The fence is added to the archival STM config batch. If no commands were applied to the STM the offset of the last applied command will be the same and the configuration batch will be applied. Otherwise, if some command was applied after the record batch was constructed the batch will not be applied.
This is essentially a concurrency control mechanism. The commands are already present in the STM but not used as for now. The existing metadata validation mechanism in the NTP archiver is still working but we no longer need to check the commands in the STM.
Backports Required
Release Notes