Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce retry-policy to shuffle #2228

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Nov 7, 2024

Introduce retry-policy to shuffle

Fixes #2148

/// # Append retry policy
///
/// Retry policy for appending records to virtual log (bifrost)
pub append_retry_policy: RetryPolicy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we shouldn't use bifrost's append_retry_policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly wasn't sure if the bifrost append retry policy should be reused here. In my mind the two polices are not related since one of them is internal to bifrost operation, while this one is from a perspective of a bifrost user (the shuffler in this case)

That being said, I am totally fine to drop this one and reuse the one from bifrost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we are already using it once we call into the Bifrost::append method, right?

I've introduced the infinite retry policy in the shuffle for the demo. Since we can now handle partition processor errors in the partition processor manager, we could also say that we don't retry outside of the bifrost retries and fail the shuffle and thereby the pp if it fails to append entries. Then it would be the responsibility of the PPM and the CC to decide whether to restart the PP or not. Or we retry a few times in the shuffle and only then give up.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @muhamadazmy. I left a few comments and suggestions how we could handle a finite number of retries in the shuffle.

/// # Append retry policy
///
/// Retry policy for appending records to virtual log (bifrost)
pub append_retry_policy: RetryPolicy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we are already using it once we call into the Bifrost::append method, right?

I've introduced the infinite retry policy in the shuffle for the demo. Since we can now handle partition processor errors in the partition processor manager, we could also say that we don't retry outside of the bifrost retries and fail the shuffle and thereby the pp if it fails to append entries. Then it would be the responsibility of the PPM and the CC to decide whether to restart the PP or not. Or we retry a few times in the shuffle and only then give up.

tokio::time::sleep(delay).await;
}
None => {
return Err(err).context("Maximum number of retries exhausted");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support finite restart strategies, then we need to change the following things: Make sure that we are not running in a TaskKind that panics on errors and we need to manage the task so that the owner (partition processor) learns about the failed task and can react to it (e.g. in the simplest case propagating it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, The only reason I thought this is okay was because this function can return Result. But indeed as you said this will cause eventually a Shutdown of the node.

But I am wondering now if the shuffler should actually implement a retry at all after your comment and the fact that bifrost will retry forever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce retry-policy to shuffle
3 participants