Pool triggerHealthCheck leaves stranded goroutines for 500ms #1641

bgentry · 2023-06-13T22:43:39Z

Describe the bug
After releasing all conns back to a pgxpool.Pool and closing the pool, there can still be goroutines left running. That's because triggerHealthCheck gets called upon Release(), and it spawns a goroutine which begins by sleeping 500ms no matter what.

This means there is no way to guarantee that a pgxpool is fully shut down with no remaining remnants, at least not without sleeping a non-trivial amount of time.

To Reproduce

I can try to put together a reproduction if it helps. I believe this is triggered by:

Running goleak in your test program
Starting up a pgxpool
Acquiring a conn
Releasing the conn
Closing the pool
Observing that goleak sometimes detects open/leaked goroutines when the program exits.

Expected behavior
Closing a pgxpool.Pool should leave behind no remnants and no open goroutines.

Actual behavior
Because there is a hardcoded 500ms sleep that doesn't terminate when i.e. the pool closes sooner, goroutines are sometimes left running after the pool is closed.

Version

Go: $ go version -> go version go1.20.4 darwin/amd64
PostgreSQL: $ psql --no-psqlrc --tuples-only -c 'select version()' -> PostgreSQL 14.8 (Homebrew) on x86_64-apple-darwin22.4.0, compiled by Apple clang version 14.0.3 (clang-1403.0.22.14.1), 64-bit
pgx: $ grep 'github.com/jackc/pgx/v[0-9]' go.mod -> github.com/jackc/pgx/v5 v5.3.1

Additional context
I did not see any past issues discussing this, so I wanted to at least confirm that it was an intentional design choice and not merely an oversight or bug.

The text was updated successfully, but these errors were encountered:

jackc · 2023-06-14T13:14:21Z

Because there is a hardcoded 500ms sleep that doesn't terminate when i.e. the pool closes sooner, goroutines are sometimes left running after the pool is closed.

But the goroutine terminates after that 500ms, right? Does this behavior cause a problem?

bgentry · 2023-06-14T14:09:22Z

@jackc I think the primary issues are:

This goes against the expectations a Go programmer has that when they close a resource that it is fully closed. This expectation is IMO strengthened by the docs for Pool.Close() which imply that all resources have been released by the time the method returns.

When using goleak in test suites to verify that no goroutines are being leaked by the code being tested, pgx generates false positives with one "leaked" goroutine per conn pool that was recently closed:

goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 21898 in state sleep, with time.Sleep on top of the stack:
goroutine 21898 [sleep]:
time.Sleep(0x1dcd6500)
	/opt/hostedtoolcache/go/1.20.4/x64/src/runtime/time.go:195 +0x135
github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck.func1()
	/home/runner/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:379 +0x[31]
created by github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck
	/home/runner/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:376 +0x8f

 Goroutine 21883 in state sleep, with time.Sleep on top of the stack:
goroutine 21883 [sleep]:
time.Sleep(0x1dcd6500)
	/opt/hostedtoolcache/go/1.20.4/x64/src/runtime/time.go:195 +0x135
github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck.func1()
	/home/runner/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:379 +0x31
created by github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck
	/home/runner/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:[37]

In order to avoid these false positives, a user of both pgx and goleak needs to add (a) a 500ms+ sleep at the end of their tests, or (b) this exception when invoking goleak:

var knownGoroutineLeaks = []goleak.Option{ //nolint:gochecknoglobals
	// This goroutine contains a 500 ms uninterruptable sleep that may still be
	// running by the time the test suite finishes and cause a failure.
	goleak.IgnoreTopFunction("github.com/jackc/pgx/v5/pgxpool.(*Pool).backgroundHealthCheck"),
}

This exception is also prone to breaking due to pgx internal refactors, so it's a hacky solution.

From a brief look at the code in pool.go, I'm wondering if instead of an uninterruptible time.Sleep(500*time.Millisecond), the goroutine in triggerHealthCheck() could be refactored to use a timer that is properly canceled & drained when that goroutine exits, and then select both that timer and p.closeChan? Something like:

t := time.NewTimer(500*time.Millisecond)

select {
case <-p.closeChan:
	if !t.Stop() {
		<-t.C
	}
	return
case <-t.C:
	select {
	case p.healthCheckChan <- struct{}{}:
	default:
	}
}

This of course doesn't 100% solve the fact that the goroutine could still be running for a brief moment after the return of Pool.Close(); that would probably require a deeper refactor, particularly a way to confirm that this and any other goroutines have exited.

I also see that elsewhere in the code you're doing something similar to the above pattern, except using a <-time.After(). This is another place where a goroutine can be left in a leaked state for up to 500ms. This could also benefit from the above pattern, or you could extract it into a slightly cleaner form with something like this: https://go.dev/play/p/pH0jHo2RZaL

jackc · 2023-06-14T14:40:18Z

🤷‍♂️

This goes against the expectations a Go programmer has that when they close a resource that it is fully closed. This expectation is IMO strengthened by the docs for Pool.Close() which imply that all resources have been released by the time the method returns.

The docs say all connections are closed, and they are. The fact that health checks are implemented via a background goroutine is an implementation detail that isn't exposed through the public interface.

When using goleak in test suites to verify that no goroutines are being leaked by the code being tested, pgx generates false positives with one "leaked" goroutine per conn pool that was recently closed:

That is unfortunate, but I view it as akin to checking memory usage as soon as a value has gone out of scope. The GC may not have run yet. Once a variable is out of scope or a Pool is closed it's not visible to the user application unless runtime introspection is used.

I don't mind if someone wants to refactor / improve this, but it's not something I personally plan on working on.

bgentry · 2023-06-14T15:10:06Z

@jackc I opened #1642 to at least try to improve these situations, although it doesn't fully resolve the potential goleak issue. At least the workaround is documented here in case anybody goes looking for it.

If in spite of the PR you don't feel this is worth fixing, no worries, feel free to close this and the PR. Thanks for your great work on pgx regardless ✌️

When a pool is closed, some background goroutines may be left open, particularly for health checks as detailed in jackc#1641. Two specific examples have been refactored here to avoid a blocking sleep and instead also select on the pool being closed to potentially return/continue sooner.

bgentry added the bug label Jun 13, 2023

bgentry mentioned this issue Jun 14, 2023

Faster shutdown of pgxpool.Pool background goroutines #1642

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool triggerHealthCheck leaves stranded goroutines for 500ms #1641

Pool triggerHealthCheck leaves stranded goroutines for 500ms #1641

bgentry commented Jun 13, 2023 •

edited

Loading

jackc commented Jun 14, 2023

bgentry commented Jun 14, 2023

jackc commented Jun 14, 2023

bgentry commented Jun 14, 2023

Pool triggerHealthCheck leaves stranded goroutines for 500ms #1641

Pool triggerHealthCheck leaves stranded goroutines for 500ms #1641

Comments

bgentry commented Jun 13, 2023 • edited Loading

jackc commented Jun 14, 2023

bgentry commented Jun 14, 2023

jackc commented Jun 14, 2023

bgentry commented Jun 14, 2023

bgentry commented Jun 13, 2023 •

edited

Loading