-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #69 #71
base: master
Are you sure you want to change the base?
Fixes #69 #71
Conversation
I tested this during the week with a project and it seems to have worked well. I'm not sure what the proper review procedure is, but I see @phadej in a lot of other PRs, so I'm citing you here too. Sorry for the unsolicited request; please let me know if there are others I should request review from for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Clearly this improves the situation, but does it fix it properly?
What if another async exception is thrown immediately, interrupting cancelAndClear
? It probably should be masked.
I also miss context. If the original issue was with using pool
, isn't a problem that this invalid connection isn't removed from a pool. I.e. a solution is just to not use that connection anymore?
-- they don't do any of the internal libpq state "cleaning" | ||
-- that is necessary. | ||
(consumeUntilNotBusy h socket >> getResult h Nothing) | ||
`catch` \(e :: SomeException) -> do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks! Should be fixed now.
Good point. I've never used libpq directly before, so my understanding here is very limited. I'm reading https://www.postgresql.org/docs/current/libpq-cancel.html and https://www.postgresql.org/docs/current/libpq-async.html but it's not clear to me whether
I didn't follow the
|
I don't know. If idea is to kill a thread, maybe? |
I think I started out with a bad example. My focus should've been on
Maybe. But would we want to disallow this? It seems perfectly valid to do so. Also, doesn't this very library do this to execute
I don't have much experience with exception handling in such cases. But thinking about it now, it seems if libpq can block or take a bit longer in such cases, that we'll already be susceptible to that behavior with this PR. Given that, we might as well mask the exception handler. What do you think? |
I was looking into masking the exception handler and discovered in https://hackage.haskell.org/package/base-4.14.1.0/docs/Control-Exception.html that "There's an implied mask around every exception handler in a call to one of the catch family of functions.", and it also seems to apply to onException. If interruptible masking is what we want this PR should be good as it is - and it seems like interruptible masking is what this library does elsewhere. But perhaps you meant we should use uninterruptible masking? I must admit my understanding of masking, interruptible operations and asynchrous exceptions is too much on the "short" side to make a decision on this. |
After reading fpco/safe-exceptions#3 (but admittedly not having read the entire conversation too closely) and https://gitlab.haskell.org/ghc/ghc/-/issues/18899, it seems to me that:
So I'm more and more inclined to do uninterruptible masking. |
This PR adds 4 new tests. The first two pass on
Sorry for the long monologue spamming this thread, @phadej, but can I get another review? The tests seem to have failed due to non related intermittent problems, and I can't retry them. |
-- (e.g. the query being canceled or session terminated), | ||
-- libpq will not throw, but will instead return a Result | ||
-- indicating an error | ||
uninterruptibleMask $ \restore -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might get an async exception after sendQuery
but before uninterruptibleMask
. Probably the whole do
block needs mask
around it with restore around sendQuery
. Sorry if I'm overlooking something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point, thanks!
@mzabani The code itself looks great! Though I have concerns regarding the approach. You simply can't guarantee that connection is valid in case of exception. Any network error will leave the connection invalid, and any subsequent query will fail. Also if Also the connection might die after it's returned into a pool, but before it's reused. So application code have to be prepared to handle the case when connection is invalid. If you don't reuse connection after exception, then you don't need to rollback the transaction there. It's enough to close the connection, and database will rollback the transaction. So IMO the real issue is
Cancelling query under So I'd suggest
|
I think I agree with this partially. There are exceptions which do not leave the connection in an invalid state, can be properly detected and still allow users to use the connection after that. One example is
I must admit I'm not experienced with Haskell's exception world, but wouldn't slowness under poor network conditions be a problem with or without uninterruptibleMask? Also, if the connection is dropped while canceling, wouldn't that actually be a synchronous exception, so
This is only true if we accept that users must discard connections when an exception is thrown, though, right? Otherwise we need I think I can summarize the options we are discussing, but let me know if I misunderstand things:
|
Oh, this PR adds a minute to Windows builds due to a new test and the distinct implementation of A whole minute is not necessary for the test; it can be reduced to something like 5 seconds. But the separate code paths date from 2014, so maybe the Windows-specific code path is no longer necessary? |
How user is supposed to know which kind of exception he is dealing with? He needs to decide whether to reuse the connection. It's OK to provide a closed list of exceptions that keep connection valid.
Not sure I understand. In the section you quoted I was talking about
Without
It's just documenting the current behavior. Most users discard connections already anyway.
Not sure I understand why. You need to be careful to make sure you don't reuse connection, but it's possible and not that hard. Also it's not the case that the connection is completely invalid, you just don't know in what state it is right now. So you can run query using such connection, but you have to be prepared to handle failure. But you need to be prepared to that anyway, so no difference. Anyway, I'm not maintainer of this library, I'm just regular user. You don't need to convince me, I'm only expressing my concerns. |
I know you mentioned you're not the maintainer, but you've raised some good points and gave me things to think about, so I'll take the liberty of replying to some of them because I think they're very pertinent. But by no means I want you to feel obliged to reply, although it'd be nice to hear more of your thoughts.
Ah, right, of course. Sorry, I misunderstood completely. I've been thinking about this over the last few days, and I've mentioned that it's still tricky to me (you can read my monologue in this thread). Reviewing my own arguments defending It now feels to me like like anytime cleanup handlers do networking, disk, or anything with a good chance of being delayed by factors beyond our control, that uninterruptible masking is too high a risk. Although that's a different discussion, and one beyond my depth.
But I think the purpose here is precisely that sometimes we want to be able to use the connection after some kinds of errors, and we need canceling in order to do that. Also, some exceptions are not thrown to our thread, so it's easy to miss them completely. Take, for example: timeout (1000000 * 10) $ query conn "SELECT very_long_running_query"
query conn "SELECT some_other_query" I'd argue it's perfectly valid code, but if the query doesn't return in 10 seconds, the second query will fail every time (and it doesn't have to). Even more subtle is just running I guess my point is that although we must be prepared to handle failures at any time, there are some unnecessary ones that can be avoided, one of which this PR fixes. |
@mzabani Looks like we agreed that timeout (1000000*60) $ do
someCode
timeout (1000000*10) $ query conn "SELECT very_long_running_query"
moreCode The following scenario is possible:
As a result the action under the outer Note that only the first exception needs to be asynchronous in this scenario: we'll happily ignore it even if the second one is synchronous. So ideally
That's what I meant by "You need to be careful to make sure you don't reuse connection". I claim that the code is invalid since it reused the connection. You should not use timeout (1000000*10) $ withResource pool $ \conn
-> query conn "SELECT very_long_running_query" If you really know what you are doing, then you either manually timeout (1000000*10) (query conn "SELECT very_long_running_query")
>>= maybe (cancelAndClear conn) return
query conn "SELECT some_other_query" Yes, it makes code less composable, but there is no good way to fix it. |
Well, we somehow ended up discussing some ugly corner cases. But what's the actually issue you want to address? I think every existing library based on |
Quite interesting ones, though. Your last example with
Good point, and it's a problem that I need to address, because it currently doesn't hold.
Personally, I believe connection pool users aren't the only consumers of this library, nor should we expect them to be (I've been working on a project, codd which isn't, for example). But still, using It is a problem I've seen at the shop I work at, and many others appear to have seen it too: lpsmith/postgresql-simple#177, tomjaguarpaw/haskell-opaleye#297 and yesodweb/persistent#1199. It's also very tricky to diagnose, and I'd say is bound to happen now and then (the raised issues kind of confirm this). From reading your last two comments, do I understand correctly that you propose one of the two below?
I'm all for number 2, but it's clear something I'm not currently capable to do. My opinion is that this PR may not fully fix the underlying problem, but it will probably improve the current situation. Because anything we come up with has pros and cons, I propose we view the possible scenarios under a model of likelihood. Suppose postgresql-simple is waiting for query results, and then asynchronous exception(s) is/are thrown. Suppose the cleanup handler does not throw synchronous exceptions, except maybe when the connection is completely closed (still an open question if it can be avoided even in that case, though). One single asynchronous exception thrown (e.g. single
|
Good connection to the DB | Slow connection to the DB | |
---|---|---|
Without this PR | Risks running into issue #69, no other correctness issues otherwise | Risks running into issue #69, no other correctness issues otherwise |
With this PR, interr. masking | < 10ms added runtime*, no errors except maybe making timeout be delayed by this added runtime wrt what was requested |
Same as "good connection", but with longer delays. Other connections have a good chance of being slow too, so not necessarily a big problem per se. If the cleanup handler throws due to the connection being closed, it's very likely there's a bigger and more general connectivity problem "anyway" |
With this PR, uninterr. masking | Same as above | Same as above |
One asynchronous exception thrown, followed by a second one (e.g. user pressing Ctrl+C, nested timeout
s or other more complex async patterns)
Good connection to the DB | Slow connection to the DB | |
---|---|---|
Without this PR | Risks running into issue #69, no other correctness issues otherwise | Risks running into issue #69, no other correctness issues otherwise |
With this PR, interr. masking | Only a problem if the second exception happens within the short interval during which cancelAndClear runs. But then, big and bad possible correctness issues arise, including timeouts not being followed at all |
Pretty bad, high level of certainty of bad correctness issues arising. |
With this PR, uninterr. masking | Same as "good connection" for one single exception | User's Ctrl+C or thread killing will have to wait for a possibly long time. If things take really long, this becomes a "zombie" thread (but then there might be connectivity issues elsewhere too) and the entire application must then be killed or the user must wait as long as necessary. |
- Over localhost with no DB overhead it's more like 1.2ms, so 10ms is completely guessed and could be very wrong. I haven't really measured it properly in more realistic scenarios.
You've mentioned that we're discussing corner cases, and I agree. May I add to that that slow connections are also unlikely themselves? Most postgres users will be connecting over LAN, which should work pretty well. To add to that, I'd guess in many cases where one connection is slow, it shouldn't be unbearably slow, and even then, it's very likely other connections are also slow (the server might be overloaded, but only one single connection being slow? Sounds unlikely), and the application will already be suffering from a general slowdown.
But given that is all recently learned for me, can I ask you to correct my table or propose other ways to look at this?
The issue in persistent was fixed by not rolling back on async exception, which is the right thing (except it should not roll back on sync exception too). I didn't check other issues, but I guess something similar should be done there. You are right that we should not assume that all users use pooling. My point is that most of them do, and adding query cancellation will make things even worse for them: instead of connection error (which they handle in some way or another anyway) they might get deadlocks (or long locks). In you application, where you don't use pooling, you are probably already handling connection errors, and (The table looks about right btw) |
I agree with @Yuras that we shouldn't be attempting to roll back on exceptions, and that adding query cancellation is a bad idea in general. If automatic query cancellation is added, I'd like to retain the existing behaviour as well. It seems that the main use-case here is using This might not be the right approach for this library, but we could also consider embedding connection pooling (via resource-pool) so that by default connections are pooled and automatically destroyed when an exception occurs. It might be hard to do this in a sensible and backwards compatible way, so maybe a separate library for combining resource-pool and postgresql-simple would make sense. (I know there are some libraries that technically do this already, but I'm thinking of something quite lightweight.) |
Sorry for taking so long to say something, I've had a busy few weeks. I certainly don't want to add either risks of incorrectness or non-interruptability if there are concerns about it. The exploration in this thread was full of new things to me, and my takeaway is that using
This sounds reasonable. Maybe we should just add
|
This allows us to avoid running queries inside exception handlers by running query canceling code before the next query is issued.
I have taken the liberty of implementing and pushing the new approach I suggested in #69 (comment). I thought it would be easier to see what I meant by pushing actual code. Please feel free to let me know if the new approach is inadequate, or if you want me to describe it or parts of it in more detail. Sorry to poke you again with this, but can @Yuras and @phadej take a look at the most recent commits? |
Looks good to me, definitely better then the initial design. I have a question though: what will happen if client is interrupted after sending the first part of a query, but before the rest is sent? I.e. query might be large enough to be split into multiple TCP packages, so server will receive the first part, but won't receive the reset. Will cancellation work in that case? Will it fail (which is OK as well)? The concern is that server (or libpq) will interpret this situation as there is no query in progress, and happily allow one to send another query, which will be interpreted by server as a continuation of the previous one, leading to a disaster. |
(To clarify: I think the design is much better because if you don't reuse connection on exception, then cancellation won't be performed, and nothing changes for you.) |
I honestly have no idea what could happen if a query is partially sent, or if it is sent in its entirety but some libpq control data isn't. At least, IIUC, this problem already exists in some form: any next query would also run into some problem, although I can't say how/if things would differ from query cancelation. I don't know much about Haskell's FFI, but ideally
This is way beyond my knowledge. I hope libpq/postgresql's protocol has control and data flows separable, so it can signal e.g. beginning-of-statement, among other things. But I know nothing about the protocol, so I'm only speculating. IIUC, these concerns aren't much aggravated by this PR, though? Also, I've made some changes to functions' signatures, such as It is very easy to revert them; the only loss is the nice error message tested in this test that I added during my exploration of the codebase, but the original issue this PR proposed to fix should still be fixed without it. So I'll revert them until (hopefully) the end of the week. The nice error message can come separately if we want it, I suppose. |
I've preserved functions' signatures and tested this branch against a personal project's test suite and also against my work's project's test suite (which is much larger than my personal project's) and they both passed. This is ready for review. |
Friendly ping of my last review request. I understand everyone is busy, so if there's something I can do to facilitate this, please let me know. |
I'm still interested in merging this, and willing to put in more work if necessary. @phadej do you think you or someone else would have time to review this? I can merge master into this or rebase, whatever your preference might be. |
I still haven't had time to test this with a project that makes extensive usage of postgresql-simple, but wrote a test that fails without this PR and passes with it.
I thought I'd push this because I see some activity on investigating this problem, and maybe this PR fixes it, so I hope it is helpful. I also hope to have more time to improve this any way we/you think would be good over the weekend.
Thoughts, corrections or even "you've got it all wrong" are very much appreciated; it's my first time contributing to a project like this.