Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect stuck TCP connections #2163

Open
redbaron opened this issue Nov 12, 2024 · 5 comments
Open

Detect stuck TCP connections #2163

redbaron opened this issue Nov 12, 2024 · 5 comments

Comments

@redbaron
Copy link
Contributor

redbaron commented Nov 12, 2024

Is your feature request related to a problem? Please describe.

When we failover PostgreSQL server to another primary, old connections can remain stuck: for whatever reason they are not closed and attempt to use them has end up with timeout:

write failed: write tcp 10.32.14.13:36102->10.36.54.61:5432: i/o timeout

we see these entries minutes after failover with all connections in connection pool

Describe the solution you'd like

It seems that detecting hung TCP connection is not trivial. TCP keepalive won't help, because with outstanding writes connection is marked as active and no keepalives are set.

One way I can think of is if TCP IO timeout was somehow decoupled from Query context timeout. net.Conn has SetWriteDeadline(t time.Time) error method which on the surface should do the trick: configurable short timeout of 1-3 seconds will be enough to detect IO error and take connection out of the pool still allowing Query context have arbitrary timeout.

Describe alternatives you've considered

Have custom Dialer which sets TCP_USER_TIMEOUT (not unlike tcp_user_timeout connection string param in libpq) with raw sys calls, but it is not portable.

Another options is pgxpool connection health check. It seems to check connection health on Acquire, which underneath executes -- ping statement, but context passed for that check is from Acquire call which in turn itself is from pgxpool.Query so it suffers from the same problem of stuck connection: context timeout is tuned for query execution time overall. If health check was executed with a smaller context timeout derived from parent context then it would detect problematic connections more reliably. This is not a preferred option because it doesn't handle stuck connection detection for already Acquired connections.

@jackc
Copy link
Owner

jackc commented Nov 16, 2024

As you say, detecting broken TCP connections is non-trivial.

I wonder if a write timeout would even work reliably? As I understand it, from Go's point of view, the write is successful when sent (maybe even only in the OS write buffer), not when acked in some manner. I think a write can succeed even when the underlying connection is broken. That's why the pool health check uses a -- ping statement. AFAIK, nothing lower level is sufficient to guarantee the connection is live.

@redbaron
Copy link
Contributor Author

I wonder if a write timeout would even work reliably?

I believe it would cover at least our case. Error line seems to failing on write and error itself seems to be from net stack, lowering write timeout should do the trick. Detecting stuck connection in some of the cases is an improvement still.

if going with this approach, we need to be careful with large batch writes, which I suspect do expect to be blocked on write for long.

That's why the pool health check

I think pool healtcheck should use derived context with much shorter timeout regardless of decision made how to proceed with stuck connection detection. Probably that timeout should be configurable, because it differs whether connection is to PG or pgxpool . Our systems use direct connection to PG, so I'd configure it to 1s.

As for the purpose of this issue, pool health check won't help us much: all connections are in constant reuse and pool health check triggers on 1s of inactivity. If TCP connection becomes "stuck" only small subset of pool connections will be lucky enough to hit it at right moment and right state for pool check to trigger AND to catch issue with connection.

@jackc
Copy link
Owner

jackc commented Nov 29, 2024

I'm open to some new setting for a write timeout. Not sure exactly what the interface would be though. Not sure how directly setting SetWriteDeadline would interact with the existing context system.

@jackc
Copy link
Owner

jackc commented Dec 6, 2024

See #2185. I proposed something there that might have some overlap with this.

@redbaron
Copy link
Contributor Author

redbaron commented Dec 6, 2024

Context is useful when app should be able to dynamically pass piece of configuration or state down the call stack. Either health check ping length or write deadline look fairly static params to me, so just a config option will suffice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants