-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve RemoteSequencer tracing and logging #2286
base: main
Are you sure you want to change the base?
Conversation
49e7718
to
8ab3785
Compare
Couple of comments:
|
So the problem was that the Inflight commits channel is closed? If yes, how did this happen? |
@tillrohrmann one case where this can happen if a a Anyway, i am refactoring this code again. Consider this PR as a draft for now |
And the sealed information wouldn't have propagated and eventually closed the
|
The reason why I am asking about the exact failure scenario is that in the scenario that I observed, the appends still completed. So I am wondering whether this doesn't indicate that the loglet wasn't sealed. Maybe you've observed a different scenario then? |
You are right. I didn't observe a seal. But I was trying to find a scenario where it is possible for the channel to be closed while the connection is not. I added few logs and then I noticed a situation where that is exactly the case. But without there was no try to reconnect but just repeated failures to send over the channel. This made me think it might be:
|
I think it would be great to understand the exact situation to learn from it (also for my personal closure ;-)) |
@tillrohrmann I totally understand your concern, I will try to reproduce again without the fix but with enough logs. Unfortunately I lost the logs from the failed run. |
446aa89
to
1993419
Compare
/// If a LogletCommitResolver is dropped without being | ||
/// 'resolved', we resolve it automatically as being cancelled | ||
/// To make it distinguished from a Shutdown. | ||
impl Drop for LogletCommitResolver { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is necessary. Since if the LogletCommitResolver is dropped and the sender channel is closed, this causes a RecvErr to be received at the receiver side which is then mapped into an AppendError::Shutdown.
I am wondering if a Shutdown is what we need to return here. This still can happen for example if a connection is lost in case of remote sequencer.
What do you think @AhmedSoliman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can map the RecvErr to a more descriptive type. For instance, CommitStatusUnknown
and mark it as retryable append error?
}; | ||
|
||
let appended = match appended { | ||
let appended = match rpc_token.recv().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly blocked forever if response message never came (lost)? Do you think it's better to timeout here?
My theory is that https://github.com/restatedev/restate/blob/891c350981f021f815b5b17fdba1dbfc37df4f9f/crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs#L243 solved the problem we were seeing. What could have caused the problem initially could be the following: There is already a connection from node 0 to node 1 where the sequencer for log-1 is running. Now PP-0 on node 0 wants to append something to log-1 and therefore creates a So maybe we revert this change in our tests to see whether we can reproduce the problem. |
@tillrohrmann yep, that's exactly what was happening and that's why I fixed it. In this PR we should improve observability and figure out if there is a design change that makes this component more resilient but I know that the actual root cause of the particular issue we saw is fixed by my change ;) |
ec730c6
to
14eced8
Compare
Alrighty. I was under the wrong impression that we are still trying to fix the root cause of the observed problem with this PR. Then I've reached my personal closure with this bug :-) |
Only semi-related: Runtime context inheritance can be quite tricky at times. Especially when components are shared across runtimes where one of them gets stopped eventually this can lead to surprising effects. |
Also avoid sending over the open connection if the inflight commits channel is closed
@tillrohrmann It's true, and there maybe a way to make this more explicit, or we actually go to a thread-per-core design and fix the root of this rolling ball of issues. |
@tillrohrmann tbh without the enriched logs and understanding of the 'inherited' runtime problem it was really hard for me to figure out why the task was aborted. The only thing that I was sure of is that the task exited while connection was still valid. And I built my solution to render the connection as disconnected if that happened ever again (for this reason or another). I also make sure that on abort of this task, that it also actually does a graceful shutdown and drain all inflight commits. This was done implicitly of course as when the CommitResolver is dropped that it automatically resolve as Shutdown so this improvement might not be necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments. In general, I don't think the current design of a task per connection will be reliable. I propose that we change the strategy and invest in RpcRouter providing the facility to resolve receive tokens when the underlying connection is dropped.
@@ -156,21 +156,40 @@ pub type SendableLogletReadStream = Pin<Box<dyn LogletReadStream + Send>>; | |||
|
|||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need this?
/// If a LogletCommitResolver is dropped without being | ||
/// 'resolved', we resolve it automatically as being cancelled | ||
/// To make it distinguished from a Shutdown. | ||
impl Drop for LogletCommitResolver { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can map the RecvErr to a more descriptive type. For instance, CommitStatusUnknown
and mark it as retryable append error?
|
||
#[derive(Debug, Clone, Copy, thiserror::Error)] | ||
#[error("Commit resolver was dropped")] | ||
struct CommitCancelled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "cancelled" often indicates a graceful user-requested cancellation of something. Abort on the other hand is used to denote an abrupt ungraceful one.
That said, I'm not sure if this error type adds enough context to the caller to understand why the commit was aborted. Do you have ideas on how either:
A) Making this error more precise to pin point why the append was aborted
B) Possibly remove the need for this error in lieu of an existing one?
let permits = self | ||
.record_permits | ||
.clone() | ||
.acquire_many_owned(len) | ||
.await | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea: Can we close() the semaphore when detect that the loglet is sealed?
| NetworkError::ConnectionClosed(_) | ||
| NetworkError::Timeout(_) => { | ||
// we retry to re-connect one time | ||
err @ NetworkError::Full => return Err(err.into()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never receive this error as you are using a async-blocking send on the channel.
// we retry to re-connect one time | ||
err @ NetworkError::Full => return Err(err.into()), | ||
_ => { | ||
// we retry on any other network error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is what we want. If the sequencer node died and the loglet was sealed, should we retry for a few times (retry policy) and then we setup Bifrost's appender to check if the loglet was sealed/metadata updated before retrying again?
let connection = self | ||
.networking | ||
.node_connection(self.params.sequencer) | ||
.await?; | ||
|
||
Span::current().record("renewed", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this really adds value. Let's remove it. The log message should be sufficient.
} | ||
|
||
/// Gets or starts a new remote sequencer connection | ||
#[instrument(level = "debug", skip_all)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The span doesn't add any value here since it has no context to add
@@ -228,26 +233,32 @@ where | |||
#[derive(Clone)] | |||
struct RemoteSequencerConnection { | |||
inner: WeakConnection, | |||
tx: mpsc::UnboundedSender<RemoteInflightAppend>, | |||
inflight: mpsc::Sender<RemoteInflightAppend>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely convinced that this in general is a good design. A different approach can be to have one task associated with the remote sequencer to handle all responses and we use refactor RpcRouter to resolve the receive token when the connection is dropped.
Thank you @AhmedSoliman for your review. I will go over and process all comments asap. |
Improve RemoteSequencer tracing and logging
Also avoid sending over the open connection if the
inflight commits channel is closed