ReactorNettyClient stucked on cancelled Conversation if that Conversation has more than 256 rows (size of reactor.bufferSize.small) #661

alexeykurshakov · 2024-09-25T04:47:06Z

Bug Report

Versions

Driver: 1.0.5
Database: PostgreSQL 13.12
Java: 17
OS: MacOS, Linux

Current Behavior

When you have query zipped in parallel with some other failed function and that query return more than 256 rows it can leads to the case when you no have real consumer, because chain was cancelled, but you receive data from database that start to save it to ReactorNettyClinet.buffer.
When this happens, any other attempts to get data from the database will fail because ReactorNettyClient.BackendMessageSubscriber.tryDrainLoop never call drainLoop because stucked conversation no have demands

private void tryDrainLoop() {
    while (hasBufferedItems() && hasDownstreamDemand()) {
        if (!drainLoop()) {
            return;
        }
    }
 }

Can reproduce using https://github.com/agorbachenko/r2dbc-connection-leak-demo
If you increase System property "reactor.bufferSize.small" to 350, the attached example will start working

The text was updated successfully, but these errors were encountered:

mp911de · 2024-09-25T07:19:37Z

Thanks a lot for chasing this issue down. Since you invested about 80% of the effort that is required to fix the issue, do you want to submit a pull request to clear out the cancelled conversations?

alexeykurshakov · 2024-09-25T10:14:36Z

I've never worked before with reactor library (mono, flux). But I found that it's not easy to track down what is the source of cancellation - error in parallel zip function, ordinal cancel or cancellation from Mono.from(fluxPublisher).
For example

  Mono.from(Flux.just(1, 2, 3).doOnCancel(() -> {
                System.out.println("fire");
            })).subscribe();

will fire println with first emit
and

 Flux.just(1, 2, 3).doOnCancel(() -> {
                System.out.println("fire");
            }).subscribe();

no println "fire"
If you can help me track down the type of cancellation, sure, I can make a pull request.

chemicL · 2024-09-25T12:30:46Z

@alexeykurshakov these cancellations have reasonable explanations. A couple examples:

Mono.from(Publisher).subscribe() cancels the Publisher once the first item is emitted, as Mono expects at most item to be emitted to the Subscriber.
Flux.just(T...).subscribe() has no reason to cancel at all, as multiple items adhere to the Flux specification.
Flux.zip(Publisher, Publisher).subscribe() will cancel the other Publisher once one of them completes/errors.

For inspiration regarding test cases, perhaps you can use my examples with mocks. This was part of the investigation whether the r2dbc-pool is responsible for the connection leaks in r2dbc/r2dbc-pool#198 (comment).

alexeykurshakov · 2024-09-26T12:59:15Z

@mp911de

r2dbc-postgresql/src/main/java/io/r2dbc/postgresql/PostgresqlStatement.java

Line 257 in a13c02c

.as(source -> Operators.discardOnCancel(source, () -> {

if you in SimpleQueryMessageFlow.exchange the original cancellation just ignored. I don't understand the correct behaviour
Why you discard cancellation with Operators.discardOnCancel and what .doOnDiscard(ReferenceCounted.class, ReferenceCountUtil::release) should do?

mp911de · 2024-09-26T14:01:03Z

Operators.discardOnCancel is to drain protocol frames off the transport so that we can finalize the conversation with the server. If we just cancelled the consumption, then response frames from an earlier conversation would remain on the transport and feed into the next conversation.

alexeykurshakov · 2024-09-26T14:31:36Z

Sounds like it should works, but not 🤣. According to an issue example badThread never consumed data and sending cancel signal after real data feed ReactorNettyClient that leads to the case when it saved this messages in internal buffer. So in that example discard happened too late.

alexeykurshakov · 2024-09-26T14:33:08Z

I can provide a timeline of what happened. And then we'll figure out how to fix it.

travispeloton · 2024-10-29T20:06:46Z

Hello! We've been hit by similar issue this past week during some load testing. I have attached a stacktrace. We also saw a few Netty LEAK errors stacktrace.

Spring Webflux and Kotlin, so we're using coroutines to await responses
We use kotlinx.coroutines.withTimeout around these queries
It usually only happens on one server instance
Lasts 15-30 seconds
Server recovers
Simple queries against the primary key (e.g., SELECT * FROM X WHERE ID = ?)
We're using a connection pool

agorbachenko · 2024-11-27T06:04:01Z

@alexeykurshakov Have you managed to conduct further investigation?

alexeykurshakov · 2024-11-28T10:38:05Z

@travispeloton Could you provide more details about your testing environment? Because I don't clear understand what do you mean by "It usually only happens on one server instance".
Also I think this is not only a single query that running on your test suite. Because a reason that a queue limit is exceed can be happened in other places. Am I right?
For further investigation it would be nice if you provide example like agorbachenko do in https://github.com/agorbachenko/r2dbc-connection-leak-demo

alexeykurshakov · 2024-11-28T10:48:56Z

@agorbachenko unfortunately no. we did a short workaround temporarily in project and moving to jooq instead.

travispeloton · 2024-11-28T14:22:07Z

@alexeykurshakov we haven't see the issue again

For "It usually only happens on one server instance", we run multiple k8s pods, so it was observed on a single pod in the 3 different times we saw it.

"not only a single query" - in my case there is one type of query that currently dominates traffic

alexeykurshakov added the status: waiting-for-triage An issue we've not yet triaged label Sep 25, 2024

chemicL mentioned this issue Sep 25, 2024

Connection leaks (not released) when concurrent query is canceled r2dbc/r2dbc-pool#198

Open

This was referenced Nov 24, 2024

connection leak when using r2dbc-pool with r2dbc-mysql asyncer-io/r2dbc-mysql#294

Closed

Connection leak when using r2dbc-pool & spring-tx r2dbc/r2dbc-pool#219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReactorNettyClient stucked on cancelled Conversation if that Conversation has more than 256 rows (size of reactor.bufferSize.small) #661

ReactorNettyClient stucked on cancelled Conversation if that Conversation has more than 256 rows (size of reactor.bufferSize.small) #661

alexeykurshakov commented Sep 25, 2024

mp911de commented Sep 25, 2024

alexeykurshakov commented Sep 25, 2024

chemicL commented Sep 25, 2024

alexeykurshakov commented Sep 26, 2024

mp911de commented Sep 26, 2024

alexeykurshakov commented Sep 26, 2024

alexeykurshakov commented Sep 26, 2024

travispeloton commented Oct 29, 2024 •

edited

Loading

agorbachenko commented Nov 27, 2024

alexeykurshakov commented Nov 28, 2024

alexeykurshakov commented Nov 28, 2024

travispeloton commented Nov 28, 2024 •

edited

Loading

ReactorNettyClient stucked on cancelled Conversation if that Conversation has more than 256 rows (size of reactor.bufferSize.small) #661

ReactorNettyClient stucked on cancelled Conversation if that Conversation has more than 256 rows (size of reactor.bufferSize.small) #661

Comments

alexeykurshakov commented Sep 25, 2024

Bug Report

Versions

Current Behavior

mp911de commented Sep 25, 2024

alexeykurshakov commented Sep 25, 2024

chemicL commented Sep 25, 2024

alexeykurshakov commented Sep 26, 2024

mp911de commented Sep 26, 2024

alexeykurshakov commented Sep 26, 2024

alexeykurshakov commented Sep 26, 2024

travispeloton commented Oct 29, 2024 • edited Loading

agorbachenko commented Nov 27, 2024

alexeykurshakov commented Nov 28, 2024

alexeykurshakov commented Nov 28, 2024

travispeloton commented Nov 28, 2024 • edited Loading

travispeloton commented Oct 29, 2024 •

edited

Loading

travispeloton commented Nov 28, 2024 •

edited

Loading