Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove handler.join call from closed gevent WebSocket tidyup #16

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

matthewowen
Copy link
Contributor

No description provided.

@matthewowen
Copy link
Contributor Author

It's possible that there's more to do here, but I'm unclear on the overall picture. Concretely, I've found that as currently implemented in master, over time all greenlets become tied up with closed connections. That is: if I run uWSGI with two greenlets, open a websocket, close it, open another, close it, I can't open a third: the async queue is full.

I've identified this line as where things get stuck: the handler.join call simply never returns. As a consequence, we never reach the following return '' which means that the greenlet servicing the WebSocket is never released, which means that over time all of one's greenlets become tied up with dead connections.

Now, I'm unclear precisely what the join is doing: even with the most trivial test case I can devise, it just never gets hit, so I'm not sure what the consequences of removing it are? I assume that the idea is to ensure that the handler greenlet doesn't float about like a little green zombie? Perhaps an alternative would be to call handler.kill?

In any case, would love some guidance on what ought to be done to ultimately remove this line: I'm pretty certain that the problem I'm running into is a general one.

@matthewowen
Copy link
Contributor Author

OK, I looked at this some more and thought about it some more and understood the problem: concretely, unless you manually do

if not connection.connected:
    break

in your while True loop in your handler, join will inevitably never return, because, well, you'll never break out of the loop.

Now, handler.kill is kinda gross, because it means that if someone has some sort of cleanup code that they want to have executed after breaking out of the while loop, then that code won't get executed... but it seems like it is always going to be problematic to guarantee that any such code will be executed, so I'm not sure that should be a concern: I mean, we could set a timeout on the join call, but then even if you were manually checking for connection.connected and then breaking out to hit the clean up code, it is still possible that you wouldn't reach the clean up section, so there's no guarantee there.

So I think that would be the best approach. If for reasons I'm not aware of you'd prefer not to change this line, then it would be a very good idea to add something to the readme that makes it clear that you need to continually check the status of the connection lest your greenlets be used up on closed connections...

@zeekay
Copy link
Owner

zeekay commented Jan 24, 2015

I'm still not sure I understand the consequences of removing this, and am considering other alternate solutions to allow for more robust handling of things. I kind of like manually checking if not connection.connected, as it's very explicit and allows you a lot of flexibility about how to handle that situation.

@zeekay
Copy link
Owner

zeekay commented Jan 24, 2015

What I do in most of the examples is wait for None to get added to the queue, which I use as a sentinel.

while True:
    msg = ws.receive()
    if msg is None:
        break

I'm surprised Greenlet.join just hangs forever, it should only hang as long as client.timeout.

At any rate I need to play around with things a bit, thanks for bringing this to my attention and documenting your problem :)

@zeekay
Copy link
Owner

zeekay commented Jan 24, 2015

This isn't the easiest project to test, sadly, but if you were interested in setting up a test case for your problem that would be amazingly helpful.

@matthewowen
Copy link
Contributor Author

Sure, I'll have a think about it. FWIW, I also discovered that checking connection.connected in the application level code is not entirely sufficient, because the following sequence of events can happen:

  1. Within the handler greenlet, we do either msg = connection.receive. This blocks execution on the handler greenlet until there's a message in the queue, and yields to allow other greenlets to make progress.
  2. At this point, we resume execution at ready = wait([handler, send_event, recv_event], None, 1).
  3. We attempt to send a message. We get an IOError, so we set connection.connected to False. At this point, since the handler greenlet yielded because it was waiting for a message, and we were just trying to send a message, then it is still going to be blocked: there isn't a message for it to get on the queue.
  4. We go back to the top of the while True loop, notice that the connection isn't connected, and join to the handler greenlet. But that handler greenlet is stuck waiting for a receive event that will never come, because we've killed the listener thread (and even if we hadn't, it's never going to receive a message, since the connection is closed). And so we never return a value :(.

I think this is a correct summation, but I could be missing something...

Based on this, it is possible that there's a way to judiciously use yield in _gevent.py to avoid this... but I'm not 100% sure right now: there might be additional similar edge cases that get us into trouble that I haven't thought of yet.

More generally, even if there's a solution which prevents this happening when making calls to methods defined in this library, it's still going to be possible for users to write code which will infinitely block. Of course, one might reasonably say that users ought not to do so, but I suspect that many users might not fully grasp the implications of potentially infinitely blocking code in a while True loop: they might (perhaps irrationally), actually expect the loop to be forcibly disrupted in the event of a broken connection.

@matthewowen
Copy link
Contributor Author

All that being said, you're right that the timeout ought to mitigate against this, so maybe I'm going crazy and it is something else. And of course, you do put a None on the receive queue, so I suppose that shouldn't be quite right, either...

@zeekay
Copy link
Owner

zeekay commented Jan 27, 2015

All the more reason we need a test suite to verify anything works as expected :)

… stops_closed_connections_tying_up_green_threads
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants