Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hard abort on resource closure if any part of the stream remains open #853

Open
wants to merge 4 commits into
base: series/0.9
Choose a base branch
from

Conversation

GrafBlutwurst
Copy link

@GrafBlutwurst GrafBlutwurst commented Apr 19, 2023

Background of this MR is that we observed leaked sockets that eventually caused our application to crash due to exhausted file handles. Granted we use websockets a bit weirdly by having them only open a short time rather than long which probably caused this problem to show up in the first place. For more background information see: https://discord.com/channels/632277896739946517/632286375311573032/1091349652579827783

Sadly I was unable to minimize an example due to time constraint but some wireshark debugging showed the following.

This is a healthy package exchange where all sockets in both applications (server and client) are torn down and cleaned up properly:

image

This is an unhealthy one:

image

Note that in both cases on the websocket protocol layer, close frames are exchanged properly. However in some (yet unidentified, I suspect concurrency, as the only way i was able to reproduce this is by concurrently firing many requests) cirumstances the server does not respond with a TCP FIN/ACK.

With a blaze server this also leads to CLOSE_WAIT sockets on the server, ember seems to handles this properly as of 0.29-RC3 with the EoS fix.

With both servers the JDK client then remains in a half open state with the server side stream remaining open. This MR checks if any side remains open in the tear down of the resource and in such a case aborts the connection. It's a bit of a sledgehammer solution but i couldn't figure out why the TCP frames are missing and this is at least a bandaid.

The problem gets much worse if proxies are involved as the half-open state of the socket also causes the tunneling connection to remain open leading to an explosion in allocated socket that are only eventually somewhat cleaned up by JDK https internal pool handling.

This is possibly related to http4s/http4s#4798 though I'm entirely unsure.

@mergify mergify bot added the core label Apr 19, 2023
closedDef.tryGet.flatMap {
case Some(_) => F.unit
case None =>
if (step < 10) F.sleep(100.millis) *> awaitClose(step + 1) else F.unit
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what a sensible amount of time to wait here is? should this be configurable? overloaded apply with a duration parameter to retain backwards compat? In practice this alleviates the need for actually using abort quite a bit which is nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this, is this equivalent to closedDef.get.timeout(1.second) ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I was looking for. For some reason it wasn't in my code completion 🤦 . thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, you probably just needed import cats.effect.syntax.all._ to get that :)

Sorry, another follow-up question to this: do we have to do this timeout? I am wondering if the JDK WebSocket implementation itself has an internal timeout, and we are just forgetting to listen for an error event or something. see the docs:

Unless the CompletableFuture returned from this method completes with IllegalArgumentException, or the method throws NullPointerException, the output will be closed.

If not already closed, the input remains open until a Close message received, or abort is invoked, or an error occurs.

https://docs.oracle.com/en/java/javase/17/docs/api/java.net.http/java/net/http/WebSocket.html#sendClose(int,java.lang.String)

Turns out in the presence of proxies some race conditions cause the
proxy tunnel to linger even if both flags indicated complete closure of Websocket
case WSFrame.Binary(data, last) => webSocket.sendBinary(data.toByteBuffer, last)
case WSFrame.Ping(data) => webSocket.sendPing(data.toByteBuffer)
case WSFrame.Pong(data) => webSocket.sendPong(data.toByteBuffer)
case WSFrame.Close(statusCode, reason) => webSocket.sendClose(statusCode, reason)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I have seen is this line failing because the output was already closed. We don't send close frames manually so the only way this should be invoked is by connectHighLevel which mirrors the close frame. Maybe we should guard this with an if, checking if the output is still open? I don't like that much though because low level usage would then have implicit restrictions.

Also as a side note, the mirroring of the closure frame in connectHighLevel actually may fail if the server emits a close code that is not applicable for a client to send.

@ybasket
Copy link
Contributor

ybasket commented Feb 13, 2024

@GrafBlutwurst Hey, sorry for the ping, but the problem this PR tackles came up again in #1015 (where I added a fix that is a simplified version of this PR). Are still planning on getting this change over the finish line? And if so, can I help somehow?

@hamnis
Copy link
Contributor

hamnis commented Nov 7, 2024

I tried merging series/0.9 into this PR, but too many things have changed here.
Are you able to pick this up again @GrafBlutwurst ?

@GrafBlutwurst
Copy link
Author

sadly i won't be able to get back to this pr in the forseeable future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants