Fix stepping down on timeout #24590

mmaslankaprv · 2024-12-17T10:57:37Z

When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the follower_busy result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the follower_busy error.

This way we prevent to eager leader step downs when follower replicas
are slow.

Backports Required

Release Notes

Improvements

stable leadership under load

vbotbuildovich · 2024-12-17T19:10:01Z

Retry command for Build#59862

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_fast_node_addition
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}

vbotbuildovich · 2024-12-17T19:15:45Z

CI test results

test results on build#59862

test_id	test_kind	job_url	test_status	passed
coordinator_rpunit.coordinator_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05c-4ce5-a326-067904a6399d	FAIL	0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2
distributed_kv_stm_tests_rpunit.distributed_kv_stm_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2
gtest_archival_rpunit.gtest_archival_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2
gtest_raft_rpunit.gtest_raft_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2
id_allocator_stm_test_rpunit.id_allocator_stm_test_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2
partition_properties_stm_test_rpunit.partition_properties_stm_test_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be	FAIL	0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59862#0193d5a4-9a48-4d45-8031-616226f6505a	FLAKY	4/6
rptest.tests.scaling_up_test.ScalingUpTest.test_fast_node_addition	ducktape	https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be	FAIL	0/1
tm_stm_tests_rpunit.tm_stm_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a	FAIL	0/2

test results on build#59902

test_id	test_kind	job_url	test_status	passed
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b161-4c06-ada3-c542ebe6df9a	FAIL	0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b162-4156-9369-57ef78449e35	FAIL	0/2
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59902#0193d913-3ccd-4237-a5c3-aa10b1fd3682	FAIL	0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59902#0193d925-6325-4c4e-a39e-b2f4af0d69c2	FLAKY	4/6

test results on build#59984

test_id	test_kind	job_url	test_status	passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01	FAIL	0/6
rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTest.test_write.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01	FLAKY	1/6
rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTestCompactedTopic.test_write.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/59984#0193e333-f09b-4aaf-ad02-aa8349cc2f01	FLAKY	1/6

vbotbuildovich · 2024-12-18T11:33:50Z

Retry command for Build#59902

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

vbotbuildovich · 2024-12-20T10:55:05Z

Retry command for Build#59984

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

src/v/raft/tests/raft_fixture.cc

bashtanov · 2024-12-20T22:52:19Z

src/v/raft/tests/raft_fixture.cc

@@ -489,6 +490,8 @@ ss::future<> raft_node_instance::stop() {
        vlog(_logger.debug, "stopping protocol");
        co_await _buffered_protocol->stop();
        co_await _protocol->stop();
+        // release f_log pointer before stopping raft


Any reason we do it before? It should not matter, as it's a shared pointer, so the underlying object will only die when _raft is gone.
Same question for stopping log before deleting raft, but it's out of scope of this PR.

src/v/raft/consensus.cc

The `raft::reply_result::follower_busy` is indicating that the follower was unable to process the heartbeat fast enough to generate a response. Renaming the reply from `timeout` will make it less confusing for the reader and differentiate the error code from an RPC timeout. Signed-off-by: Michał Maślanka <[email protected]>

Signed-off-by: Michał Maślanka <[email protected]>

Wired raft RPC service handler into Raft fixture to make the tests more accurate and cover the service code with tests. Signed-off-by: Michał Maślanka <[email protected]>

Propagating timeout to the node sending RPC request is crucial for accurate testing of Raft implementation. Signed-off-by: Michał Maślanka <[email protected]>

Added a wrapper around the `storage::log` allowing us to inject storage layer failures in Raft fixture tests. Signed-off-by: Michał Maślanka <[email protected]>

When follower is busy it may fail fast processing full heartbeat requests sent by the leader. In this case a follower RPC handler sets the `follower_busy` result in heartbeat_reply. Leader should still treat a follower replica as online in this case. The replica hosting node must be online to reply with the `follower_busy` error. This way we prevent to eager leader step downs when follower replicas are slow. Signed-off-by: Michał Maślanka <[email protected]>

Signed-off-by: Michał Maślanka <[email protected]>

travisdowns · 2024-12-23T14:58:19Z

Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the follower_busy error.

Right, and I just to clarify, it is not exactly that the leader should treat the follower as "online" for the purposes of how it interacts with the follower (e.g., decisions about whether to send it rpc payloads), but that it should not consider itself isolated from that follower when making step down decisions, right?

travisdowns · 2024-12-23T15:00:56Z

src/v/raft/fundamental.h

@@ -32,7 +32,7 @@ enum class reply_result : uint8_t {
    success,
    failure,
    group_unavailable,
-    timeout
+    follower_busy


I a bit confused, is the timeout case no longer possible?

Don't we have two cases at least: follow replies immediately with "busy", and also follower never replies?

I see, I guess the timeout case never ends using the reply_result, it will be handled in a different path: these codes are only for cases where an RPC was actually received, right?

vbotbuildovich · 2025-01-02T15:01:34Z

/backport v24.3.x

vbotbuildovich · 2025-01-02T15:01:35Z

/backport v24.2.x

vbotbuildovich · 2025-01-02T15:02:43Z

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24590-v24.2.x-670 remotes/upstream/v24.2.x
git cherry-pick -x 6a1e34bead 95a29dba65 5f69d9b733 7d33bb5659 f04995a751 8b57b42101 67e7c6ea21

Workflow run logs.

vbotbuildovich · 2025-01-02T15:02:44Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24590-v24.3.x-322 remotes/upstream/v24.3.x
git cherry-pick -x 6a1e34bead 95a29dba65 5f69d9b733 7d33bb5659 f04995a751 8b57b42101 67e7c6ea21

Workflow run logs.

github-actions bot added area/build area/redpanda labels Dec 17, 2024

mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch 3 times, most recently from 717a5fa to c321e29 Compare December 17, 2024 15:45

mmaslankaprv requested review from dotnwat, bharathv, bashtanov, ztlpn and travisdowns December 17, 2024 16:22

mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from c321e29 to e203f89 Compare December 18, 2024 08:02

mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from e203f89 to 592fc64 Compare December 20, 2024 07:13

bashtanov reviewed Dec 20, 2024

View reviewed changes

mmaslankaprv added 7 commits December 23, 2024 09:01

storage/ntp_config: added copy method to ntp_config

95a29db

Signed-off-by: Michał Maślanka <[email protected]>

raft/tests: wired in rpc service into the raft_fixture

5f69d9b

Wired raft RPC service handler into Raft fixture to make the tests more accurate and cover the service code with tests. Signed-off-by: Michał Maślanka <[email protected]>

r/tests: handle timeout in in memory rpc protocol

7d33bb5

Propagating timeout to the node sending RPC request is crucial for accurate testing of Raft implementation. Signed-off-by: Michał Maślanka <[email protected]>

r/tets: added failure injectable log

f04995a

Added a wrapper around the `storage::log` allowing us to inject storage layer failures in Raft fixture tests. Signed-off-by: Michał Maślanka <[email protected]>

r/tests: added test checking if leadership is stable after timeout

67e7c6e

Signed-off-by: Michał Maślanka <[email protected]>

mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from 592fc64 to 67e7c6e Compare December 23, 2024 08:03

mmaslankaprv requested a review from bashtanov December 23, 2024 14:56

travisdowns reviewed Dec 23, 2024

View reviewed changes

mmaslankaprv enabled auto-merge December 23, 2024 17:07

mmaslankaprv requested a review from travisdowns January 2, 2025 08:03

travisdowns approved these changes Jan 2, 2025

View reviewed changes

mmaslankaprv merged commit 8543b66 into redpanda-data:dev Jan 2, 2025
17 checks passed

This was referenced Jan 2, 2025

[v24.3.x] Fix stepping down on timeout #24667

Open

[v24.2.x] Fix stepping down on timeout #24668

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stepping down on timeout #24590

Fix stepping down on timeout #24590

mmaslankaprv commented Dec 17, 2024 •

edited

Loading

vbotbuildovich commented Dec 17, 2024

vbotbuildovich commented Dec 17, 2024 •

edited

Loading

vbotbuildovich commented Dec 18, 2024

vbotbuildovich commented Dec 20, 2024

bashtanov Dec 20, 2024

travisdowns commented Dec 23, 2024

travisdowns Dec 23, 2024

travisdowns Dec 23, 2024

mmaslankaprv Dec 23, 2024

vbotbuildovich commented Jan 2, 2025

vbotbuildovich commented Jan 2, 2025

vbotbuildovich commented Jan 2, 2025

vbotbuildovich commented Jan 2, 2025

Fix stepping down on timeout #24590

Fix stepping down on timeout #24590

Conversation

mmaslankaprv commented Dec 17, 2024 • edited Loading

Backports Required

Release Notes

Improvements

vbotbuildovich commented Dec 17, 2024

Retry command for Build#59862

vbotbuildovich commented Dec 17, 2024 • edited Loading

CI test results

vbotbuildovich commented Dec 18, 2024

Retry command for Build#59902

vbotbuildovich commented Dec 20, 2024

Retry command for Build#59984

bashtanov Dec 20, 2024

Choose a reason for hiding this comment

travisdowns commented Dec 23, 2024

travisdowns Dec 23, 2024

Choose a reason for hiding this comment

travisdowns Dec 23, 2024

Choose a reason for hiding this comment

mmaslankaprv Dec 23, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Jan 2, 2025

vbotbuildovich commented Jan 2, 2025

vbotbuildovich commented Jan 2, 2025

vbotbuildovich commented Jan 2, 2025

mmaslankaprv commented Dec 17, 2024 •

edited

Loading

vbotbuildovich commented Dec 17, 2024 •

edited

Loading