Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support to batch metric and send on a strict interval #169

Closed

Conversation

maciuszek
Copy link
Contributor

@maciuszek maciuszek commented Dec 19, 2024

Add more definite send interval control than what flush interval provided, restricting when any metrics is sent to this interval. This will implicitly introduce batching for timers.

@maciuszek maciuszek marked this pull request as draft December 19, 2024 21:09
@maciuszek maciuszek force-pushed the mattkuzminski/enforce-batching-for-all-metrics branch from 01f9e4d to 384f462 Compare December 20, 2024 00:51
Copy link

@sokada1221 sokada1221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few non-binding comments - as I may not have full context/understanding:

  • I think it's generally a good idea to implement batching with 2 parameters: batch size and interval. Since interval is already configurable, it'd be nice to have batch size configurable - with default being the magic approxMaxMemBytes/bufSize lol
  • How do you plan to test the change? Should we add some metrics/logs to verify the batching behavior?

counter = sink.record
if !strings.Contains(counter, expected) {
t.Error("wanted counter value of test.___f=i:1|c, got", counter)
expected = "test.__host=i:1|c"
Copy link
Contributor Author

@maciuszek maciuszek Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this test is out of scope of this work but it was previously volatile with the order of reserved_tag vs test.__host not being deterministic

net_sink.go Outdated
@@ -118,8 +118,8 @@ func NewNetSink(opts ...SinkOption) FlushableSink {
bufSize = defaultBufferSizeTCP
}

s.outc = make(chan *bytes.Buffer, approxMaxMemBytes/bufSize)
s.retryc = make(chan *bytes.Buffer, 1) // It should be okay to limit this given we preferentially process from this over outc.
s.outc = make(chan *bytes.Buffer, approxMaxMemBytes/bufSize) // todo: need to understand why/how this number was chosen and probably elevate it
Copy link
Contributor Author

@maciuszek maciuszek Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make threading possible forking #169 (review) to here . @sokada1221.

So this doesn't restrict memory allocated, but the amount of slots per metric/string.
If it exceeds it'll block more stats for being written until read, it wouldn't act as a batching mechanism 🤔. buffered channeled are a bit strange, i think in actuality this buffer will always be full, if we would change it to normal channel with no buffer (always block), i don't think we would see an impact.

@maciuszek maciuszek force-pushed the mattkuzminski/enforce-batching-for-all-metrics branch from 972503c to 5a6293c Compare December 24, 2024 19:58
net_sink.go Outdated
}
}
return batch[:0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error in send this will cause any following batched metrics to be dropped. Might be better to keep them queued. Additionally, this prevents us from putting any buffers received after the error back into the buffer pool (which is only a problem if we decide to keep the existing logic).

Copy link
Contributor Author

@maciuszek maciuszek Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an error in send, the metrics will be written to the retryc channel so nothing should be dropped i think, despite clearing it from the batch. In the current state, that retry handling will escape batching altogether, will think on this.

Nevermind, I see any failure prevents any/all subsequent sends in the current iteration, good catch, thanks

@maciuszek maciuszek force-pushed the mattkuzminski/enforce-batching-for-all-metrics branch from 76f2891 to c3b9a08 Compare December 30, 2024 15:21
@maciuszek maciuszek force-pushed the mattkuzminski/enforce-batching-for-all-metrics branch from 9e755e1 to b7f0a3d Compare December 31, 2024 03:58
add more comments (will need to cleanup later)
fix bug with channel block when batching
@maciuszek maciuszek marked this pull request as ready for review January 2, 2025 19:08
settings.go Outdated
@@ -20,6 +20,8 @@ const (
DefaultFlushIntervalS = 5
// DefaultLoggingSinkDisabled is the default behavior of logging sink suppression, default is false.
DefaultLoggingSinkDisabled = false
// DefaultBatchSize is the default maximum amount of stats we batch before sending, default is 0 which disables batching.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're batching based on a number of stats here instead of the size in MB? In that case I would think a batch of 1 would be equivalent to no batching, not 0; a 0 batch size doesn't really make sense.

But even better if we could explicitly enable/disable batching via config, I think that would make things more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly yeah, the batch size specifies the number of stats we batch.

batch size 1 will effectively be similar as no batching, but 0 specifically reverts back to the old behaviour hence my use of it, as in don't go through the batch code at all.

We could have 2 configurations for this, since i'm effectively using 0 like a feature flag anyway. But will we really ever clean it up, and having 1 configuration hard dependant on the other isn't too great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Made the configuration more granular

add test for batch send failure
remove some unnecessary comments
@maciuszek maciuszek force-pushed the mattkuzminski/enforce-batching-for-all-metrics branch from 68e9ecb to 51ee282 Compare January 2, 2025 21:09
fix bug with batch interval
reclarify some comments
improved code structure and comments
@maciuszek maciuszek changed the title support a strict flush interval for all metrics add support to batch metric and send on a strict interval Jan 3, 2025
net_sink.go Outdated
batchTimeout := time.Duration(s.conf.BatchSendIntervalS) * time.Second
batchInterval := time.After(batchTimeout)

sendBatch := false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to doSendBatch or shouldSendBatch

net_sink.go Outdated
default:
// Drop through in case retryc has nothing.
}

// send batched outc data anytime indicated or the batch is full
if sendBatch || len(batch) >= batchSize {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of nesting and conditions here is a bit hard to follow. Hopefully there's some way to refactor that would improve readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code a bit and I think it's a bit more readable.

The only alternative I can think of is forking another run function for batching and determine which to use when launching the go routine.

Doing that may indeed be more efficient since there will be less conditions per loop, but the drawback is we'll have duplicate code which will be hard to maintain in 2 places 🤔

What are your thoughts, I could modularize some of the internals but the structure will effectively be duplicated.

paulinjo
paulinjo previously approved these changes Jan 6, 2025
@maciuszek maciuszek closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants