Skip to content

RATIS-2418. NettyClient LoggingHandler prints too many messages.#1359

Merged
szetszwo merged 3 commits intoapache:masterfrom
slfan1989:RATIS-2418
Feb 26, 2026
Merged

RATIS-2418. NettyClient LoggingHandler prints too many messages.#1359
szetszwo merged 3 commits intoapache:masterfrom
slfan1989:RATIS-2418

Conversation

@slfan1989
Copy link
Contributor

What changes were proposed in this pull request?

This PR reduces test log noise by adjusting logging levels and making logging more selective:

  • NettyRpcProxy.java: Changed LoggingHandler from LogLevel.WARN to LogLevel.DEBUG with explanatory comment, since it emits all Netty pipeline events at the chosen level.
  • TestRaftAsyncWithNetty.java: Override base class DEBUG settings by raising RaftServer.Division and RaftClient log levels back to INFO to reduce verbose test output.
  • Modified testWithLoad Timer to only print cluster/client state when no progress is detected (moved logs into the "no progress" branch)
    • Before changes:
      • Timer prints unconditionally every 10 seconds (starting after 5 seconds)
      • Log frequency = 1 time / 10 seconds = 6 times / minute
    • After changes:
      • Only prints when no progress is detected
      • No progress definition: lastStep remains unchanged between consecutive Timer checks (10-second interval)
      • lastStep update condition: cumulative progress delta ≥ 50 × numClients
      • Trigger condition: cumulative progress delta < 50 × numClients within 10 seconds
  • Converted all string concatenation logging to placeholder style ({}) for better performance and consistency

What is the link to the Apache JIRA

JIRA: RATIS-2418. NettyClient LoggingHandler prints too many messages.

How was this patch tested?

Existing unit tests.

@slfan1989
Copy link
Contributor Author

@szetszwo Could you please take a look and share your feedback when you have a moment? Thanks a lot!

@szetszwo
Copy link
Contributor

TestRaftAsyncWithNetty.testWithLoad(..) failed with NPE:

Caused by: java.lang.NullPointerException: Request #3851 not found
	at java.base/java.util.Objects.requireNonNull(Objects.java:336)
	at org.apache.ratis.netty.NettyRpcProxy$Connection.getReplyFuture(NettyRpcProxy.java:203)
	at org.apache.ratis.netty.NettyRpcProxy$Connection.access$100(NettyRpcProxy.java:149)
	at org.apache.ratis.netty.NettyRpcProxy$Connection$1.channelRead0(NettyRpcProxy.java:159)
	at org.apache.ratis.netty.NettyRpcProxy$Connection$1.channelRead0(NettyRpcProxy.java:155)
	at org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	... 25 more

@slfan1989
Copy link
Contributor Author

TestRaftAsyncWithNetty.testWithLoad(..) failed with NPE:

Caused by: java.lang.NullPointerException: Request #3851 not found
	at java.base/java.util.Objects.requireNonNull(Objects.java:336)
	at org.apache.ratis.netty.NettyRpcProxy$Connection.getReplyFuture(NettyRpcProxy.java:203)
	at org.apache.ratis.netty.NettyRpcProxy$Connection.access$100(NettyRpcProxy.java:149)
	at org.apache.ratis.netty.NettyRpcProxy$Connection$1.channelRead0(NettyRpcProxy.java:159)
	at org.apache.ratis.netty.NettyRpcProxy$Connection$1.channelRead0(NettyRpcProxy.java:155)
	at org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	... 25 more

@szetszwo Thanks for your message—I’ll follow up on this.

@szetszwo
Copy link
Contributor

When the callId is not found in replies, we probably should just ignore it but not failing.

+++ b/ratis-netty/src/main/java/org/apache/ratis/netty/NettyRpcProxy.java
@@ -157,6 +157,9 @@ public class NettyRpcProxy implements Closeable {
         protected void channelRead0(ChannelHandlerContext ctx,
                                     RaftNettyServerReplyProto proto) {
           final CompletableFuture<RaftNettyServerReplyProto> future = getReplyFuture(getCallId(proto), null);
+          if  (future == null) {
+            return;
+          }
           if (proto.getRaftNettyServerReplyCase() == EXCEPTIONREPLY) {
             final Object ioe = ProtoUtils.toObject(proto.getExceptionReply().getException());
             future.completeExceptionally((IOException)ioe);
@@ -199,7 +202,6 @@ public class NettyRpcProxy implements Closeable {
     private CompletableFuture<RaftNettyServerReplyProto> getReplyFuture(long callId,
         CompletableFuture<RaftNettyServerReplyProto> expected) {
       final CompletableFuture<RaftNettyServerReplyProto> removed = replies.remove(callId);
-      Objects.requireNonNull(removed, () -> "Request #" + callId + " not found");
       if (expected != null) {
         Preconditions.assertSame(expected, removed, "removed");
       }

@slfan1989
Copy link
Contributor Author

@szetszwo I’m still working on pinpointing the root cause of this issue. So far, I haven’t been able to reproduce it in my local environment, so I’ll need a bit more time to continue investigating and validating. I’ll push this forward as quickly as possible and will update here as soon as I have any progress.

@slfan1989
Copy link
Contributor Author

When the callId is not found in replies, we probably should just ignore it but not failing.

+++ b/ratis-netty/src/main/java/org/apache/ratis/netty/NettyRpcProxy.java
@@ -157,6 +157,9 @@ public class NettyRpcProxy implements Closeable {
         protected void channelRead0(ChannelHandlerContext ctx,
                                     RaftNettyServerReplyProto proto) {
           final CompletableFuture<RaftNettyServerReplyProto> future = getReplyFuture(getCallId(proto), null);
+          if  (future == null) {
+            return;
+          }
           if (proto.getRaftNettyServerReplyCase() == EXCEPTIONREPLY) {
             final Object ioe = ProtoUtils.toObject(proto.getExceptionReply().getException());
             future.completeExceptionally((IOException)ioe);
@@ -199,7 +202,6 @@ public class NettyRpcProxy implements Closeable {
     private CompletableFuture<RaftNettyServerReplyProto> getReplyFuture(long callId,
         CompletableFuture<RaftNettyServerReplyProto> expected) {
       final CompletableFuture<RaftNettyServerReplyProto> removed = replies.remove(callId);
-      Objects.requireNonNull(removed, () -> "Request #" + callId + " not found");
       if (expected != null) {
         Preconditions.assertSame(expected, removed, "removed");
       }

@szetszwo Thank you for helping review the code. I think I've identified the root cause (Race Condition and Memory Leak in NettyRpcProxy), and the unit tests are passing after the fix:

Root Cause

NettyRpcProxy.offer() method is responsible for sending requests and awaiting replies. The original implementation has a race condition:

// Wrong order
client.writeAndFlush(request);     // 1. Send request first
replies.put(callId, reply);        // 2. Put future into map later

// Race window: If the reply arrives extremely fast (e.g., localhost),
// channelRead0() may execute before put(), resulting in:
// - Cannot find the future for callId in the map
// - Throws "Request #X not found" exception
// - May cause unexpected connection closure

Fix Applied

// Correct order
replies.put(callId, reply);              // 1. Put future into map first
try {
  future = client.writeAndFlush(request); // 2. Then send request
} catch (AlreadyClosedException e) {
  replies.remove(callId, reply);         // 3. Rollback on sync failure
  return completeExceptionally(e);
}

// 4. Also rollback on async failure
future.addListener(cf -> {
  if (!cf.isSuccess()) {
    getReplyFuture(callId, reply, "write-failure"); // Remove from map
  }
});

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slfan1989 , thanks a lot for digging out the bug! You are right that we must put the future first.

+1 the change looks good.

@szetszwo szetszwo merged commit a5651f6 into apache:master Feb 26, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants