Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let PartitionProcessors failures propagate to the PartitionProcessorManager #2214

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Nov 5, 2024

Let PartitionProcessors failures propagate to the PartitionProcessorManager

Summary:
When PP crashes (returns an error) we should update it's status instead
of panicing

Fixes #2147


Stack created with Sapling. Best reviewed with ReviewStack.

@muhamadazmy
Copy link
Contributor Author

This change prevents a failure of a PP from bringing down the entire runtime, instead the error is propagated to the PPM and the status of the PP is basically popped out from the inner map. This will then show up on the CC next status pull and can then decide to take an action.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @muhamadazmy. The changes look good to me. I had one comment regarding the clean shut down case where a Stopped event could also be helpful. +1 for merging after resolving the comments.

@@ -648,6 +652,14 @@ impl<T: TransportConnect> PartitionProcessorManager<T> {
error!(%partition_id, error=%err, "Starting partition processor failed");
self.running_partition_processors.remove(&partition_id);
}
ProcessorEvent::Stopped(err) => {
error!(%partition_id, error=%err, "Partition processor exited");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe warn because the system is not yet broken (hopefully). warn!(%partition_id, "Partition processor exited unexpectedly: {err}");

let _ = events
.send(ManagerEvent {
partition_id,
event: ProcessorEvent::Stopped(err),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would probably also return a Stopped event if the partition processor exited w/o an error. That way the receiver can also clean up its data structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. Good catch!

…anager

Summary:
When PP crashes (returns an error) we should update it's status instead
of panicing

Fixes restatedev#2147
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 for merging.

@muhamadazmy muhamadazmy merged commit 2f00d4c into restatedev:main Nov 8, 2024
17 of 18 checks passed
@muhamadazmy muhamadazmy deleted the pr2214 branch November 8, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Let PartitionProcessors failures propagate to the PartitionProcessorManager
2 participants