Improve RemoteSequencer tracing and logging #2286

muhamadazmy · 2024-11-13T14:49:28Z

Improve RemoteSequencer tracing and logging

Also avoid sending over the open connection if the
inflight commits channel is closed

muhamadazmy · 2024-11-13T15:37:19Z

Couple of comments:

It took multiple iterations to actually land on the remote sequencer issue. After this fix I was not able to reproduce the same issue but. I got other failures though, that I will add to the main issues
I am wondering if the "remote-sequencer-connection" should have a TaskKind::SequencerAppender and then gracefully shut it down. to have a chance to process pending commit tokens. (I will update the PR)

tillrohrmann · 2024-11-13T16:14:11Z

So the problem was that the Inflight commits channel is closed? If yes, how did this happen?

muhamadazmy · 2024-11-13T16:32:45Z

@tillrohrmann one case where this can happen if a a sealed response was received, this breaks the loop, the channel is then drained and all inflight commits are resolved as 'sealed'. But at the same time the connection is still valid, hence sending the next append will still work, but pushing to the inflight channel will fail.

Anyway, i am refactoring this code again. Consider this PR as a draft for now

tillrohrmann · 2024-11-13T16:38:37Z

@tillrohrmann one case where this can happen if a a sealed response was received, this breaks the loop, the channel is then drained and all inflight commits are resolved as 'sealed'. But at the same time the connection is still valid, hence sending the next append will still work, but pushing to the inflight channel will fail.

Anyway, i am refactoring this code again. Consider this PR as a draft for now

And the sealed information wouldn't have propagated and eventually closed the RemoteSequencer? Like here

restate/crates/bifrost/src/providers/replicated_loglet/loglet.rs

Line 195 in 96f4bfc

if self.known_global_tail().is_sealed() {

?

tillrohrmann · 2024-11-13T16:42:51Z

The reason why I am asking about the exact failure scenario is that in the scenario that I observed, the appends still completed. So I am wondering whether this doesn't indicate that the loglet wasn't sealed. Maybe you've observed a different scenario then?

muhamadazmy · 2024-11-13T16:59:22Z

You are right. I didn't observe a seal. But I was trying to find a scenario where it is possible for the channel to be closed while the connection is not.

I added few logs and then I noticed a situation where that is exactly the case. But without there was no try to reconnect but just repeated failures to send over the channel. This made me think it might be:

Connection is returning a different error on failure to send, so I expanded the list of errors on send where we retry
Changed the task kind to a different type (to basically wait for the drain)
Make sure channel is checked before attempt to send.

tillrohrmann · 2024-11-13T17:31:46Z

I think it would be great to understand the exact situation to learn from it (also for my personal closure ;-))

muhamadazmy · 2024-11-13T18:21:34Z

@tillrohrmann I totally understand your concern, I will try to reproduce again without the fix but with enough logs. Unfortunately I lost the logs from the failed run.

muhamadazmy · 2024-11-14T08:20:56Z

crates/bifrost/src/loglet/mod.rs

+/// If a LogletCommitResolver is dropped without being
+/// 'resolved', we resolve it automatically as being cancelled
+/// To make it distinguished from a Shutdown.
+impl Drop for LogletCommitResolver {


I am not sure this is necessary. Since if the LogletCommitResolver is dropped and the sender channel is closed, this causes a RecvErr to be received at the receiver side which is then mapped into an AppendError::Shutdown.

I am wondering if a Shutdown is what we need to return here. This still can happen for example if a connection is lost in case of remote sequencer.

What do you think @AhmedSoliman

Perhaps we can map the RecvErr to a more descriptive type. For instance, CommitStatusUnknown and mark it as retryable append error?

muhamadazmy · 2024-11-14T08:22:02Z

crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs

-            };
-
-            let appended = match appended {
+            let appended = match rpc_token.recv().await {


Possibly blocked forever if response message never came (lost)? Do you think it's better to timeout here?

@tillrohrmann @AhmedSoliman

tillrohrmann · 2024-11-14T10:35:24Z

My theory is that https://github.com/restatedev/restate/blob/891c350981f021f815b5b17fdba1dbfc37df4f9f/crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs#L243 solved the problem we were seeing.

What could have caused the problem initially could be the following: There is already a connection from node 0 to node 1 where the sequencer for log-1 is running. Now PP-0 on node 0 wants to append something to log-1 and therefore creates a RemoteSequencer for this loglet. It will reuse the existing connection but spawn handle_appended_responses on its own runtime pp-0. Now PP-0 gets shut down which also stops the task for handle_appended_responses. Now PP-0 gets restarted and wants to append to log-1 yet again. It will reuse the existing RemoteSequencer which still has a valid connection assigned but no longer a handle_appended_responses task. Note that this wouldn't have been a problem if the connection reactor would also have run on pp-0 because then it would have been terminated as well.

So maybe we revert this change in our tests to see whether we can reproduce the problem.

AhmedSoliman · 2024-11-14T12:47:04Z

@tillrohrmann yep, that's exactly what was happening and that's why I fixed it.

In this PR we should improve observability and figure out if there is a design change that makes this component more resilient but I know that the actual root cause of the particular issue we saw is fixed by my change ;)

tillrohrmann · 2024-11-14T13:13:49Z

@tillrohrmann yep, that's exactly what was happening and that's why I fixed it.

In this PR we should improve observability and figure out if there is a design change that makes this component more resilient but I know that the actual root cause of the particular issue we saw is fixed by my change ;)

Alrighty. I was under the wrong impression that we are still trying to fix the root cause of the observed problem with this PR. Then I've reached my personal closure with this bug :-)

tillrohrmann · 2024-11-14T13:18:13Z

Only semi-related: Runtime context inheritance can be quite tricky at times. Especially when components are shared across runtimes where one of them gets stopped eventually this can lead to surprising effects.

Also avoid sending over the open connection if the inflight commits channel is closed

AhmedSoliman · 2024-11-14T14:07:20Z

@tillrohrmann It's true, and there maybe a way to make this more explicit, or we actually go to a thread-per-core design and fix the root of this rolling ball of issues.

muhamadazmy · 2024-11-14T14:10:08Z

@tillrohrmann tbh without the enriched logs and understanding of the 'inherited' runtime problem it was really hard for me to figure out why the task was aborted. The only thing that I was sure of is that the task exited while connection was still valid. And I built my solution to render the connection as disconnected if that happened ever again (for this reason or another).

I also make sure that on abort of this task, that it also actually does a graceful shutdown and drain all inflight commits. This was done implicitly of course as when the CommitResolver is dropped that it automatically resolve as Shutdown so this improvement might not be necessary.

AhmedSoliman

Left a couple of comments. In general, I don't think the current design of a task per connection will be reliable. I propose that we change the strategy and invest in RpcRouter providing the facility to resolve receive tokens when the underlying connection is dropped.

AhmedSoliman · 2024-11-15T10:14:55Z

crates/bifrost/src/loglet/mod.rs

@@ -156,21 +156,40 @@ pub type SendableLogletReadStream = Pin<Box<dyn LogletReadStream + Send>>;

 #[allow(dead_code)]


do we still need this?

AhmedSoliman · 2024-11-15T10:17:46Z

crates/bifrost/src/loglet/mod.rs

+/// If a LogletCommitResolver is dropped without being
+/// 'resolved', we resolve it automatically as being cancelled
+/// To make it distinguished from a Shutdown.
+impl Drop for LogletCommitResolver {


Perhaps we can map the RecvErr to a more descriptive type. For instance, CommitStatusUnknown and mark it as retryable append error?

AhmedSoliman · 2024-11-15T10:22:08Z

crates/bifrost/src/loglet/mod.rs

+
+#[derive(Debug, Clone, Copy, thiserror::Error)]
+#[error("Commit resolver was dropped")]
+struct CommitCancelled;


The term "cancelled" often indicates a graceful user-requested cancellation of something. Abort on the other hand is used to denote an abrupt ungraceful one.

That said, I'm not sure if this error type adds enough context to the caller to understand why the commit was aborted. Do you have ideas on how either:
A) Making this error more precise to pin point why the append was aborted
B) Possibly remove the need for this error in lieu of an existing one?

AhmedSoliman · 2024-11-15T10:26:57Z

crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs

+            let permits = self
+                .record_permits
+                .clone()
+                .acquire_many_owned(len)
+                .await
+                .unwrap();


idea: Can we close() the semaphore when detect that the loglet is sealed?

AhmedSoliman · 2024-11-15T10:36:03Z

crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs

-                        | NetworkError::ConnectionClosed(_)
-                        | NetworkError::Timeout(_) => {
-                            // we retry to re-connect one time
+                        err @ NetworkError::Full => return Err(err.into()),


We should never receive this error as you are using a async-blocking send on the channel.

AhmedSoliman · 2024-11-15T10:39:35Z

crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs

-                            // we retry to re-connect one time
+                        err @ NetworkError::Full => return Err(err.into()),
+                        _ => {
+                            // we retry on any other network error


I wonder if this is what we want. If the sequencer node died and the loglet was sealed, should we retry for a few times (retry policy) and then we setup Bifrost's appender to check if the loglet was sealed/metadata updated before retrying again?

AhmedSoliman · 2024-11-15T10:40:13Z

crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs

        let connection = self
            .networking
            .node_connection(self.params.sequencer)
            .await?;

+        Span::current().record("renewed", true);


I'm not sure this really adds value. Let's remove it. The log message should be sufficient.

AhmedSoliman · 2024-11-15T10:40:45Z

crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs

    }

    /// Gets or starts a new remote sequencer connection
+    #[instrument(level = "debug", skip_all)]


The span doesn't add any value here since it has no context to add

AhmedSoliman · 2024-11-15T10:44:47Z

crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs

@@ -228,26 +233,32 @@ where
 #[derive(Clone)]
 struct RemoteSequencerConnection {
    inner: WeakConnection,
-    tx: mpsc::UnboundedSender<RemoteInflightAppend>,
+    inflight: mpsc::Sender<RemoteInflightAppend>,


I'm not entirely convinced that this in general is a good design. A different approach can be to have one task associated with the remote sequencer to handle all responses and we use refactor RpcRouter to resolve the receive token when the connection is dropped.

muhamadazmy · 2024-11-15T17:02:03Z

Thank you @AhmedSoliman for your review. I will go over and process all comments asap.

muhamadazmy force-pushed the pr2286 branch 2 times, most recently from 49e7718 to 8ab3785 Compare November 13, 2024 15:24

muhamadazmy marked this pull request as ready for review November 13, 2024 15:24

muhamadazmy force-pushed the pr2286 branch from 8ab3785 to ced7de2 Compare November 13, 2024 15:25

muhamadazmy requested a review from AhmedSoliman November 13, 2024 15:26

muhamadazmy force-pushed the pr2286 branch from ced7de2 to 4eb6d90 Compare November 13, 2024 16:47

muhamadazmy requested a review from tillrohrmann November 13, 2024 17:30

muhamadazmy force-pushed the pr2286 branch 3 times, most recently from 446aa89 to 1993419 Compare November 14, 2024 08:16

muhamadazmy commented Nov 14, 2024

View reviewed changes

muhamadazmy force-pushed the pr2286 branch 2 times, most recently from ec730c6 to 14eced8 Compare November 14, 2024 12:58

muhamadazmy force-pushed the pr2286 branch from 14eced8 to d9be214 Compare November 14, 2024 13:57

Improve RemoteSequencer tracing and logging

8f3856b

Also avoid sending over the open connection if the inflight commits channel is closed

muhamadazmy force-pushed the pr2286 branch from d9be214 to 8f3856b Compare November 14, 2024 14:00

AhmedSoliman requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve RemoteSequencer tracing and logging #2286

Improve RemoteSequencer tracing and logging #2286

muhamadazmy commented Nov 13, 2024 •

edited

Loading

muhamadazmy commented Nov 13, 2024 •

edited

Loading

tillrohrmann commented Nov 13, 2024

muhamadazmy commented Nov 13, 2024

tillrohrmann commented Nov 13, 2024

tillrohrmann commented Nov 13, 2024

muhamadazmy commented Nov 13, 2024

tillrohrmann commented Nov 13, 2024

muhamadazmy commented Nov 13, 2024

muhamadazmy Nov 14, 2024

AhmedSoliman Nov 15, 2024

muhamadazmy Nov 14, 2024 •

edited

Loading

tillrohrmann commented Nov 14, 2024

AhmedSoliman commented Nov 14, 2024

tillrohrmann commented Nov 14, 2024

tillrohrmann commented Nov 14, 2024

AhmedSoliman commented Nov 14, 2024

muhamadazmy commented Nov 14, 2024

AhmedSoliman left a comment

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

AhmedSoliman Nov 15, 2024

muhamadazmy commented Nov 15, 2024

		@@ -156,21 +156,40 @@ pub type SendableLogletReadStream = Pin<Box<dyn LogletReadStream + Send>>;

		#[allow(dead_code)]

Improve RemoteSequencer tracing and logging #2286

Are you sure you want to change the base?

Improve RemoteSequencer tracing and logging #2286

Conversation

muhamadazmy commented Nov 13, 2024 • edited Loading

muhamadazmy commented Nov 13, 2024 • edited Loading

tillrohrmann commented Nov 13, 2024

muhamadazmy commented Nov 13, 2024

tillrohrmann commented Nov 13, 2024

tillrohrmann commented Nov 13, 2024

muhamadazmy commented Nov 13, 2024

tillrohrmann commented Nov 13, 2024

muhamadazmy commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhamadazmy Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

tillrohrmann commented Nov 14, 2024

AhmedSoliman commented Nov 14, 2024

tillrohrmann commented Nov 14, 2024

tillrohrmann commented Nov 14, 2024

AhmedSoliman commented Nov 14, 2024

muhamadazmy commented Nov 14, 2024

AhmedSoliman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhamadazmy commented Nov 15, 2024

muhamadazmy commented Nov 13, 2024 •

edited

Loading

muhamadazmy commented Nov 13, 2024 •

edited

Loading

muhamadazmy Nov 14, 2024 •

edited

Loading