-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce retry-policy to shuffle #2228
base: main
Are you sure you want to change the base?
Conversation
94ee2a2
to
2d01133
Compare
/// # Append retry policy | ||
/// | ||
/// Retry policy for appending records to virtual log (bifrost) | ||
pub append_retry_policy: RetryPolicy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why we shouldn't use bifrost's append_retry_policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly wasn't sure if the bifrost append retry policy should be reused here. In my mind the two polices are not related since one of them is internal to bifrost operation, while this one is from a perspective of a bifrost user (the shuffler in this case)
That being said, I am totally fine to drop this one and reuse the one from bifrost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we are already using it once we call into the Bifrost::append method, right?
I've introduced the infinite retry policy in the shuffle for the demo. Since we can now handle partition processor errors in the partition processor manager, we could also say that we don't retry outside of the bifrost retries and fail the shuffle and thereby the pp if it fails to append entries. Then it would be the responsibility of the PPM and the CC to decide whether to restart the PP or not. Or we retry a few times in the shuffle and only then give up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @muhamadazmy. I left a few comments and suggestions how we could handle a finite number of retries in the shuffle.
/// # Append retry policy | ||
/// | ||
/// Retry policy for appending records to virtual log (bifrost) | ||
pub append_retry_policy: RetryPolicy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we are already using it once we call into the Bifrost::append method, right?
I've introduced the infinite retry policy in the shuffle for the demo. Since we can now handle partition processor errors in the partition processor manager, we could also say that we don't retry outside of the bifrost retries and fail the shuffle and thereby the pp if it fails to append entries. Then it would be the responsibility of the PPM and the CC to decide whether to restart the PP or not. Or we retry a few times in the shuffle and only then give up.
tokio::time::sleep(delay).await; | ||
} | ||
None => { | ||
return Err(err).context("Maximum number of retries exhausted"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to support finite restart strategies, then we need to change the following things: Make sure that we are not running in a TaskKind
that panics on errors and we need to manage the task so that the owner (partition processor) learns about the failed task and can react to it (e.g. in the simplest case propagating it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, The only reason I thought this is okay was because this function can return Result
. But indeed as you said this will cause eventually a Shutdown of the node.
But I am wondering now if the shuffler should actually implement a retry at all after your comment and the fact that bifrost will retry forever
Introduce retry-policy to shuffle
Fixes #2148