-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hard abort on resource closure if any part of the stream remains open #853
base: series/0.9
Are you sure you want to change the base?
Add hard abort on resource closure if any part of the stream remains open #853
Conversation
closedDef.tryGet.flatMap { | ||
case Some(_) => F.unit | ||
case None => | ||
if (step < 10) F.sleep(100.millis) *> awaitClose(step + 1) else F.unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what a sensible amount of time to wait here is? should this be configurable? overloaded apply
with a duration parameter to retain backwards compat? In practice this alleviates the need for actually using abort
quite a bit which is nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this, is this equivalent to closedDef.get.timeout(1.second)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I was looking for. For some reason it wasn't in my code completion 🤦 . thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, you probably just needed import cats.effect.syntax.all._
to get that :)
Sorry, another follow-up question to this: do we have to do this timeout? I am wondering if the JDK WebSocket implementation itself has an internal timeout, and we are just forgetting to listen for an error event or something. see the docs:
Unless the CompletableFuture returned from this method completes with IllegalArgumentException, or the method throws NullPointerException, the output will be closed.
If not already closed, the input remains open until a Close message received, or abort is invoked, or an error occurs.
Turns out in the presence of proxies some race conditions cause the proxy tunnel to linger even if both flags indicated complete closure of Websocket
case WSFrame.Binary(data, last) => webSocket.sendBinary(data.toByteBuffer, last) | ||
case WSFrame.Ping(data) => webSocket.sendPing(data.toByteBuffer) | ||
case WSFrame.Pong(data) => webSocket.sendPong(data.toByteBuffer) | ||
case WSFrame.Close(statusCode, reason) => webSocket.sendClose(statusCode, reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I have seen is this line failing because the output was already closed. We don't send close frames manually so the only way this should be invoked is by connectHighLevel
which mirrors the close frame. Maybe we should guard this with an if, checking if the output is still open? I don't like that much though because low level usage would then have implicit restrictions.
Also as a side note, the mirroring of the closure frame in connectHighLevel
actually may fail if the server emits a close code that is not applicable for a client to send.
@GrafBlutwurst Hey, sorry for the ping, but the problem this PR tackles came up again in #1015 (where I added a fix that is a simplified version of this PR). Are still planning on getting this change over the finish line? And if so, can I help somehow? |
I tried merging series/0.9 into this PR, but too many things have changed here. |
sadly i won't be able to get back to this pr in the forseeable future |
Background of this MR is that we observed leaked sockets that eventually caused our application to crash due to exhausted file handles. Granted we use websockets a bit weirdly by having them only open a short time rather than long which probably caused this problem to show up in the first place. For more background information see: https://discord.com/channels/632277896739946517/632286375311573032/1091349652579827783
Sadly I was unable to minimize an example due to time constraint but some wireshark debugging showed the following.
This is a healthy package exchange where all sockets in both applications (server and client) are torn down and cleaned up properly:
This is an unhealthy one:
Note that in both cases on the websocket protocol layer, close frames are exchanged properly. However in some (yet unidentified, I suspect concurrency, as the only way i was able to reproduce this is by concurrently firing many requests) cirumstances the server does not respond with a TCP FIN/ACK.
With a blaze server this also leads to CLOSE_WAIT sockets on the server, ember seems to handles this properly as of 0.29-RC3 with the EoS fix.
With both servers the JDK client then remains in a half open state with the server side stream remaining open. This MR checks if any side remains open in the tear down of the resource and in such a case aborts the connection. It's a bit of a sledgehammer solution but i couldn't figure out why the TCP frames are missing and this is at least a bandaid.
The problem gets much worse if proxies are involved as the half-open state of the socket also causes the tunneling connection to remain open leading to an explosion in allocated socket that are only eventually somewhat cleaned up by JDK https internal pool handling.
This is possibly related to http4s/http4s#4798 though I'm entirely unsure.