Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaps in mt_events sequence causing subscriptions to be paused #3503

Open
I4QvViGzUN0F opened this issue Oct 23, 2024 · 5 comments
Open

Gaps in mt_events sequence causing subscriptions to be paused #3503

I4QvViGzUN0F opened this issue Oct 23, 2024 · 5 comments

Comments

@I4QvViGzUN0F
Copy link

We experienced a problem this morning where unavoidable maintenance on our azure postgres database caused a gap in seq_id values in the mt_events table. This caused the high water mark to never make it above the gap and subscriptions to be paused as a result.

image

As the image shows there is a gap between 30041 and 30051.

The logs around this time show a connection issue to the postgres database between 06:01:38 and 06:02:36 and then some indication about a gap was detected but when checking the mt_event_progression 6 hours later we found that the HighWaterMark and all of the subscriptions were stuck at 30041. They didn't proceed from 30051 up to the latest sequence id 31382 as expected.

2024-10-23 06:02:38.174 +00:00 [INF] Daemon projection high water detection skipping a gap in event sequence, determined that the 'safe harbor' sequence is at null | { SourceContext: "Marten.Events.Daemon.Coordination.ProjectionCoordinator", MachineName: "aio-quality-eventsubscription-api-7795cf5ffc-7w452" } |

2024-10-23 06:02:38.171 +00:00 [INF] High Water agent is stale after threshold of 3 seconds, skipping gap to events marked after "2024-10-23T06:01:40.2499450+00:00" for database Marten | { SourceContext: "Marten.Events.Daemon.Coordination.ProjectionCoordinator", MachineName: "aio-quality-eventsubscription-api-7795cf5ffc-7w452" } |

2024-10-23 06:02:38.171 +00:00 [INF] High Water agent is stale at 30041 for database Marten | { SourceContext: "Marten.Events.Daemon.Coordination.ProjectionCoordinator", MachineName: "aio-quality-eventsubscription-api-7795cf5ffc-7w452" } |

2024-10-23 06:02:36.683 +00:00 [ERR] Failed while trying to detect high water statistics for database Marten | { SourceContext: "Marten.Events.Daemon.Coordination.ProjectionCoordinator", MachineName: "aio-quality-eventsubscription-api-7795cf5ffc-7w452" } | Npgsql.NpgsqlException (0x80004005): Failed to connect to xxx.xxx.xxx.xxx\n ---> System.Net.Sockets.SocketException (111): Connection refused\n at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)\n at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)\n at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)\n at Npgsql.TaskTimeoutAndCancellation.ExecuteAsync(Func`2 getTaskFunc, NpgsqlTimeout timeout, CancellationToken cancellationToken)\n at Npgsql.Internal.NpgsqlConnector.ConnectAsync(NpgsqlTimeout timeout, CancellationToken cancellationToken)\n at Npgsql.Internal.NpgsqlConnector.ConnectAsync(NpgsqlTimeout timeout, CancellationToken cancellationToken)\n at Npgsql.Internal.NpgsqlConnector.RawOpen(SslMode sslMode, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken, Boolean isFirstAttempt)\n at Npgsql.Internal.NpgsqlConnector.<Open>g__OpenCore|213_1(NpgsqlConnector conn, SslMode sslMode, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken, Boolean isFirstAttempt)\n at Npgsql.Internal.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)\n at Npgsql.PoolingDataSource.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)\n at Npgsql.PoolingDataSource.<Get>g__RentAsync|34_0(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)\n at Npgsql.NpgsqlConnection.<Open>g__OpenAsync|42_0(Boolean async, CancellationToken cancellationToken)\n at Marten.Storage.MartenDatabase.SingleQuery`1.ExecuteAsync(CancellationToken cancellation)\n at Marten.Storage.MartenDatabase.SingleQuery`1.ExecuteAsync(CancellationToken cancellation)\n at Polly.ResiliencePipeline.<>c__9`2.<<ExecuteAsync>b__9_0>d.MoveNext()\n--- End of stack trace from previous location ---\n at Polly.Outcome`1.GetResultOrRethrow()\n at Polly.ResiliencePipeline.ExecuteAsync[TResult,TState](Func`3 callback, TState state, CancellationToken cancellationToken)\n at Marten.Events.Daemon.HighWater.HighWaterDetector.loadCurrentStatistics(CancellationToken token)\n at Marten.Events.Daemon.HighWater.HighWaterDetector.Detect(CancellationToken token)\n at Marten.Events.Daemon.HighWater.HighWaterAgent.detectChanges()

To fix this we had to manually set the HighWaterMark to 31382 and all the subscriptions to 30051 and then restart the async daemon. Is it intended that the async daemon should gracefully recover from such gaps in the sequence because it seems inevitable in our case that these will happen?

@jeremydmiller
Copy link
Member

jeremydmiller commented Oct 23, 2024

" Is it intended that the async daemon should gracefully recover from such gaps in the sequence because it seems inevitable in our case that these will happen?"

Yes, but if the app can't connect to postgres -- which it looks like it was unable to do so from your error messages -- there's nothing that Marten could do.

What are you wanting to happen here? I can't say that there's many reports of this behavior. This might be just the 2nd, and in the first case, I know there was a lot of other wacky stuff going on too.

"To fix this we had to manually set the HighWaterMark to 31382 and all the subscriptions to 30051 and then restart the async daemon." -- try not to mess with that table if you can help it. I would have thought that restarting the app would have allowed the high water mark to skip over the gaps after a little bit of time.

Also, you might look into the "quick append" option for the event store to avoid having so many gaps in the first place. That might also help.

@jeremydmiller
Copy link
Member

Also, did Marten keep logging connectivity errors, or does it appear that the high water mark detection just stopped in this case (which it most certainly should not ever do so)?

@I4QvViGzUN0F
Copy link
Author

The problem is that even after the connection to postgres has been restored (the maintenance lasts for 1-2 minutes typically), the async daemon does not detect the new high water mark until restarted.

Also, did Marten keep logging connectivity errors, or does it appear that the high water mark detection just stopped in this case (which it most certainly should not ever do so)?

It just seems to stop. There are no log lines after the 06:02:38.174 one posted above.

I just tested locally by stopping postgres running in docker and issuing an ALTER SEQUENCE quality.mt_events_sequence INCREMENT BY 10 and it works as expected. When postgres comes back up the async daemon continues as normal. The difference is in these 2 logs. The first one is from my local test

Daemon projection high water detection skipping a gap in event sequence, determined that the 'safe harbor' sequence is at 973

And the original one from my issue which is running in aks

Daemon projection high water detection skipping a gap in event sequence, determined that the 'safe harbor' sequence is at null

As you can see the difference is 'safe harbor' sequence is null. I would've expected it to detect the safe harbor sequence as 10051 taking the example from above. Do you know what would cause it to be detected as null because that seems to be the problem here?

Also, you might look into the "quick append" option for the event store to avoid having so many gaps in the first place. That might also help.

We are using quick append already. The sequence gap seems to be caused solely by the maintenance (is this a thing?) and not by failed transactions because there are no attempted event writes around this time.

@jeremydmiller
Copy link
Member

"The sequence gap seems to be caused solely by the maintenance (is this a thing?)" -- yeah, totally see that being an issue if you shut down things hard w/o doing a graceful stop

We've had reports of determined that the 'safe harbor' sequence is at null before. I'll pay attention to that here. To be clear, you were able to stop a database, restart it, and see Marten recover?

@I4QvViGzUN0F
Copy link
Author

Yes, I was able to do that locally and the async daemon picked up from the sequence id after the gap once postgres was back up. I did only try to reproduce that once and it worked as expected and continued processing subscriptions after the gap. I don't know if this has some specific reproduction steps, maybe specific to azure, but I'm still trying to reproduce it in a controlled way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants