Introduce retry-policy to shuffle #2228

muhamadazmy · 2024-11-07T10:01:02Z

Introduce retry-policy to shuffle

AhmedSoliman · 2024-11-13T14:41:05Z

crates/types/src/config/worker.rs

+    /// # Append retry policy
+    ///
+    /// Retry policy for appending records to virtual log (bifrost)
+    pub append_retry_policy: RetryPolicy,


any reason why we shouldn't use bifrost's append_retry_policy?

I honestly wasn't sure if the bifrost append retry policy should be reused here. In my mind the two polices are not related since one of them is internal to bifrost operation, while this one is from a perspective of a bifrost user (the shuffler in this case)

That being said, I am totally fine to drop this one and reuse the one from bifrost.

Technically, we are already using it once we call into the Bifrost::append method, right?

I've introduced the infinite retry policy in the shuffle for the demo. Since we can now handle partition processor errors in the partition processor manager, we could also say that we don't retry outside of the bifrost retries and fail the shuffle and thereby the pp if it fails to append entries. Then it would be the responsibility of the PPM and the CC to decide whether to restart the PP or not. Or we retry a few times in the shuffle and only then give up.

tillrohrmann

Thanks for creating this PR @muhamadazmy. I left a few comments and suggestions how we could handle a finite number of retries in the shuffle.

tillrohrmann · 2024-11-15T14:39:24Z

crates/types/src/config/worker.rs

+    /// # Append retry policy
+    ///
+    /// Retry policy for appending records to virtual log (bifrost)
+    pub append_retry_policy: RetryPolicy,


Technically, we are already using it once we call into the Bifrost::append method, right?

I've introduced the infinite retry policy in the shuffle for the demo. Since we can now handle partition processor errors in the partition processor manager, we could also say that we don't retry outside of the bifrost retries and fail the shuffle and thereby the pp if it fails to append entries. Then it would be the responsibility of the PPM and the CC to decide whether to restart the PP or not. Or we retry a few times in the shuffle and only then give up.

tillrohrmann · 2024-11-15T14:40:52Z

crates/worker/src/partition/shuffle.rs

+                                    tokio::time::sleep(delay).await;
+                                }
+                                None => {
+                                    return Err(err).context("Maximum number of retries exhausted");


If we want to support finite restart strategies, then we need to change the following things: Make sure that we are not running in a TaskKind that panics on errors and we need to manage the task so that the owner (partition processor) learns about the failed task and can react to it (e.g. in the simplest case propagating it).

hmm, The only reason I thought this is okay was because this function can return Result. But indeed as you said this will cause eventually a Shutdown of the node.

But I am wondering now if the shuffler should actually implement a retry at all after your comment and the fact that bifrost will retry forever

This was referenced Nov 7, 2024

Let PartitionProcessors failures propagate to the PartitionProcessorManager #2214

Merged

Avoid blocking in the PartitionProcessorManager when being in the control loop #2179

Merged

muhamadazmy force-pushed the pr2228 branch 2 times, most recently from 94ee2a2 to 2d01133 Compare November 7, 2024 10:03

muhamadazmy marked this pull request as ready for review November 7, 2024 12:14

muhamadazmy requested a review from tillrohrmann November 7, 2024 12:14

muhamadazmy force-pushed the pr2228 branch from 2d01133 to 0aa9e7a Compare November 7, 2024 14:54

Introduce retry-policy to shuffle

08523b5

Fixes restatedev#2148

muhamadazmy force-pushed the pr2228 branch from 0aa9e7a to 08523b5 Compare November 8, 2024 08:16

AhmedSoliman reviewed Nov 13, 2024

View reviewed changes

tillrohrmann reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce retry-policy to shuffle #2228

Introduce retry-policy to shuffle #2228

muhamadazmy commented Nov 7, 2024 •

edited

Loading

AhmedSoliman Nov 13, 2024

muhamadazmy Nov 14, 2024

tillrohrmann Nov 15, 2024

tillrohrmann left a comment

tillrohrmann Nov 15, 2024

tillrohrmann Nov 15, 2024

muhamadazmy Nov 20, 2024

Introduce retry-policy to shuffle #2228

Are you sure you want to change the base?

Introduce retry-policy to shuffle #2228

Conversation

muhamadazmy commented Nov 7, 2024 • edited Loading

AhmedSoliman Nov 13, 2024

Choose a reason for hiding this comment

muhamadazmy Nov 14, 2024

Choose a reason for hiding this comment

tillrohrmann Nov 15, 2024

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann Nov 15, 2024

Choose a reason for hiding this comment

tillrohrmann Nov 15, 2024

Choose a reason for hiding this comment

muhamadazmy Nov 20, 2024

Choose a reason for hiding this comment

muhamadazmy commented Nov 7, 2024 •

edited

Loading