Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stepping down on timeout #24590

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Dec 17, 2024

When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the follower_busy result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the follower_busy error.

This way we prevent to eager leader step downs when follower replicas
are slow.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • stable leadership under load

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59862

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_fast_node_addition
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 17, 2024

CI test results

test results on build#59862
test_id test_kind job_url test_status passed
coordinator_rpunit.coordinator_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05c-4ce5-a326-067904a6399d FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
distributed_kv_stm_tests_rpunit.distributed_kv_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
gtest_archival_rpunit.gtest_archival_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
id_allocator_stm_test_rpunit.id_allocator_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
partition_properties_stm_test_rpunit.partition_properties_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d5a4-9a48-4d45-8031-616226f6505a FLAKY 4/6
rptest.tests.scaling_up_test.ScalingUpTest.test_fast_node_addition ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be FAIL 0/1
tm_stm_tests_rpunit.tm_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
test results on build#59902
test_id test_kind job_url test_status passed
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b161-4c06-ada3-c542ebe6df9a FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b162-4156-9369-57ef78449e35 FAIL 0/2
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59902#0193d913-3ccd-4237-a5c3-aa10b1fd3682 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59902#0193d925-6325-4c4e-a39e-b2f4af0d69c2 FLAKY 4/6
test results on build#59984
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01 FAIL 0/6
rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTest.test_write.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01 FLAKY 1/6
rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestCompactedTopic.test_write.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01 FLAKY 1/6

@mmaslankaprv mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from c321e29 to e203f89 Compare December 18, 2024 08:02
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59902

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

@mmaslankaprv mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from e203f89 to 592fc64 Compare December 20, 2024 07:13
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59984

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

src/v/raft/tests/raft_fixture.cc Outdated Show resolved Hide resolved
src/v/raft/tests/raft_fixture.cc Outdated Show resolved Hide resolved
@@ -489,6 +490,8 @@ ss::future<> raft_node_instance::stop() {
vlog(_logger.debug, "stopping protocol");
co_await _buffered_protocol->stop();
co_await _protocol->stop();
// release f_log pointer before stopping raft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we do it before? It should not matter, as it's a shared pointer, so the underlying object will only die when _raft is gone.
Same question for stopping log before deleting raft, but it's out of scope of this PR.

src/v/raft/consensus.cc Show resolved Hide resolved
The `raft::reply_result::follower_busy` is indicating that the follower
was unable to process the heartbeat fast enough to generate a response.
Renaming the reply from `timeout` will make it less confusing for the
reader and differentiate the error code from an RPC timeout.

Signed-off-by: Michał Maślanka <[email protected]>
Wired raft RPC service handler into Raft fixture to make the tests more
accurate and cover the service code with tests.

Signed-off-by: Michał Maślanka <[email protected]>
Propagating timeout to the node sending RPC request is crucial for
accurate testing of Raft implementation.

Signed-off-by: Michał Maślanka <[email protected]>
Added a wrapper around the `storage::log` allowing us to inject storage
layer failures in Raft fixture tests.

Signed-off-by: Michał Maślanka <[email protected]>
When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the `follower_busy` result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the `follower_busy` error.

This way we prevent to eager leader step downs when follower replicas
are slow.
Signed-off-by: Michał Maślanka <[email protected]>
@mmaslankaprv mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from 592fc64 to 67e7c6e Compare December 23, 2024 08:03
@travisdowns
Copy link
Member

Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the follower_busy error.

Right, and I just to clarify, it is not exactly that the leader should treat the follower as "online" for the purposes of how it interacts with the follower (e.g., decisions about whether to send it rpc payloads), but that it should not consider itself isolated from that follower when making step down decisions, right?

@@ -32,7 +32,7 @@ enum class reply_result : uint8_t {
success,
failure,
group_unavailable,
timeout
follower_busy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I a bit confused, is the timeout case no longer possible?

Don't we have two cases at least: follow replies immediately with "busy", and also follower never replies?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I guess the timeout case never ends using the reply_result, it will be handled in a different path: these codes are only for cases where an RPC was actually received, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

@mmaslankaprv mmaslankaprv merged commit 8543b66 into redpanda-data:dev Jan 2, 2025
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24590-v24.2.x-670 remotes/upstream/v24.2.x
git cherry-pick -x 6a1e34bead 95a29dba65 5f69d9b733 7d33bb5659 f04995a751 8b57b42101 67e7c6ea21

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24590-v24.3.x-322 remotes/upstream/v24.3.x
git cherry-pick -x 6a1e34bead 95a29dba65 5f69d9b733 7d33bb5659 f04995a751 8b57b42101 67e7c6ea21

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants