Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection contention #356

Open
ltagliamonte opened this issue Jan 25, 2025 · 3 comments
Open

Connection contention #356

ltagliamonte opened this issue Jan 25, 2025 · 3 comments

Comments

@ltagliamonte
Copy link

ltagliamonte commented Jan 25, 2025

I was doing a bit of perf tuning and by doubling the connection number my latencies got much better.
I understand that connections are shared and the more connections "the better" but I'd rather have a way to monitor contention on the connection (gorutines that wait for a connection from the pool) than do guess work especially when the size of my worker pool is parameterized.

Is there a way to expose connection contention? Maybe a log enabled via config?

Thanks a lot for the great project!

@mediocregopher
Copy link
Owner

Hi @ltagliamonte , can you clarify if you're using v3 or v4? The way the Pool works in each is extremely different, so it's not possible to answer without knowing which we're talking about.

@ltagliamonte
Copy link
Author

Hello @mediocregopher I'm using v4

@mediocregopher
Copy link
Owner

mediocregopher commented Jan 26, 2025

Nice, thanks. So in v4 this question is a bit tricky because there's potentially two places which could be blocking:

  1. Getting a Conn from the Pool. This can block if all Conns have been removed from the Pool, and it's currently empty. A Conn is only removed from the Pool if the Action which is going to be performed is not shareable, which 99% of the time means it is a blocking command like BRPOP, otherwise the Conn is left in the Pool and shared with other shareable Actions. So for the Pool to be empty (and therefore blocking) you'd have to be doing more non-shareable Actions than there are Conns in the Pool.

If you want to know how many non-shareable Actions are taking place within your Pool you could inspect it using a very simple interface:

type poolWrapper struct {
	radix.Client
	nonShareableActionsGauge atomic.Uint64
}

func (pw *poolWrapper) Do(ctx context.Context, a radix.Action) error {
	if !a.Properties().CanShareConn {
		pw.nonShareableActionsGauge.Add(1)
		defer pw.nonShareableActionsGauge.Add(-1)
		// Or however you want to measure it
	}
	return pw.Client.Do(ctx, a)
}

// Spin up a go-routine to periodically log nonShareableActions
  1. For Actions which are shareable, their EncodeDecode calls will be automatically pipelined within Conn. In effect any blocking which happens at this stage is as a result of network congestion, where either the time it takes to write to the socket or read responses back from it is preventing subsequent Actions from having their turn. If you want to know how many Actions are blocked at this part you could essentially do the opposite of the example above: increment a counter for every active shareable Action. Dividing that by the Pool size would give you roughly the current number of Actions which are blocked per Conn.

What you asked for, a log message like "Action is blocked because the Pool is too small" is unfortunately not something which is easily determined, because all Actions block for some amount of time. The only question is how long is acceptable. If you're using a metrics server like Prometheus then a wrapper like the above can be a great place to record action times on a histogram, and once the time it takes to Do an Action has gotten too high you increase the Pool size some more. If you're not using Prometheus you could use an in-memory histogram library to the same effect.

One final note, which doesn't answer your question but might help, is to check out the WriteFlushInterval field of the Dialer if you haven't yet. By setting that to something like 150 microseconds you can increase the overall throughput of Conns, as it will reduce the number of system calls being made even further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants