Prevent indefinite unsent metrics loop in outpus.azure_monitor #15908

Hr0bar · 2024-09-17T10:13:19Z

Use Case

Hi, we are using outputs.azure_monitor for more than 3 years now and only now we ran into this issue which triggered this feature request. Basically a global option to drop unsent metrics by a time limit instead of buffer size limit could be a solution, or output plugin specific solution could be added?

The issue we experienced:

For more than 30 minutes the Azure endpoint was Timeouting:

Since around Sep 13 07:00:00:
Sep 13 07:47:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:47:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:48:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:48:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:49:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:49:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:50:19 HOSTNAME.com telegraf[1305]: 2024-09-13T07:50:19Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 13 07:51:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:51:09Z E! [agent] Error writing to outputs.azure_monitor: Post "https://northeurope.monitoring.azure.com/subscriptions/CENSORED/metrics": read tcp 172.CENSORED.CENSORED.CENSORED:43962->20.50.65.82:443: read: connection timed out

Then Azure endpoint became responsive again, but since it was down for more than 30 minutes, all outputs Telegraf has buffered are now rejected by Azure, resulting in indefinite loop of never sent metrics:

Sep 13 07:52:10 HOSTNAME.com telegraf[1305]: 2024-09-13T07:52:10Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
Sep 13 07:53:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:53:09Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
Sep 13 07:54:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:54:09Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
Sep 13 07:55:09 HOSTNAME.com telegraf[1305]: 2024-09-13T07:55:09Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}

...Forever until Telegraf restarts

For now we will drastically reduce the buffer size limit to avoid this. But wee add/remove metrics to monitoring all the time, its not feasible to recalculate proper buffer limits (metric_batch_size and metric_buffer_limit) every time to achieve desired unsent metrics time limit before dropping them, so this solution doesnt seem very "clean" ?

One extra note on documentation, we had to take a look at the Telegraf source code comments to find out an important info that oldest metrics are overwritten, which is important for us to know that reducing the buffer size will help us in most cases:

	// MetricBufferLimit is the max number of metrics that each output plugin
	// will cache. The buffer is cleared when a successful write occurs. When
	// full, the oldest metrics will be overwritten. This number should be a
	// multiple of MetricBatchSize. Due to current implementation, this could
	// not be less than 2 times MetricBatchSize.

The official docs say just:

metric_buffer_limit: Maximum number of unwritten metrics per output. Increasing this value allows for longer periods of output downtime without dropping metrics at the cost of higher maximum memory usage.

Which is not enough to make an educated configuration decision, perhaps it could be made more verbose ?

Expected behavior

Reliably configure to drop unsent metrics by time instead of buffer sizes. We add/remove metrics to monitoring all the time, its not feasible to recalculate proper buffer limits (metric_batch_size and metric_buffer_limit) every time to achieve desired unsent metrics time limit before dropping them.

Alternatively, a force drop of old metrics could occur when
E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}

is detected in Azure output plugin, as a one plugin only solution.

Alternatively, perhaps a documentation update in Azure output plugin could be added and be enough, warning all users that aggressive buffer sizes should be used to avoid this issue. The rationale is that a global option to drop by time limit could be an overkill / not warranted enough, but that is for the Telegraf project to decide :)

Actual behavior

Azure metrics / possibly other output destinations will reject metrics older than certain time limit. During network/other issues resulting into outputs not being sent, Telegraf will buffer metrics older than that time limit indefinitely in a loop where metrics older than the time limit are scheduled to be sent (at the top of the stack) indefinitely.

Additional info

Similar to #13273

The text was updated successfully, but these errors were encountered:

Hr0bar · 2024-09-17T10:29:16Z

Actually just found #14928 where it says:

Currently, when a Telegraf output metric queue fills, either due to incoming
metrics being too fast or various issues with writing to the output, new
metrics are dropped and never written to the output.

While in code comments we see this:

	// MetricBufferLimit is the max number of metrics that each output plugin
	// will cache. The buffer is cleared when a successful write occurs. When
	// full, the oldest metrics will be overwritten. This number should be a
	// multiple of MetricBatchSize. Due to current implementation, this could
	// not be less than 2 times MetricBatchSize.

Im not so sure anymore that the workaround with aggressive buffer limit will help now, since its not very clear whether old are dropped, or old kept and new dropped, in which case aggressive buffer limit wont help.

srebhan · 2024-10-02T15:15:48Z

The code comment is correct and the spec needs correction. Are you willing to put up a PR to fix the spec wording? I can confirm that actually the oldest metrics in the buffer are dropped (overwritten by new metrics) first.

srebhan · 2024-10-02T15:16:41Z

Regarding your issue, I'm preparing a new way of selectively dropping metrics on outputs and I want to hold this up until this is ready. Is that OK for you?

Hr0bar · 2024-10-02T17:43:47Z

Thanks, yes we are fine waiting for some more comprehensive solution. Ill have a look at PR for wording in the spec (and possibly also expanding the wording in "metric_buffer_limit" description in docs, perhaps even azure output plugin docs could use some mention of the Azure limit, some Azure users may not be aware).

srebhan · 2024-10-16T18:35:36Z

@Hr0bar would love to get your feedback on PR #16034!

Hr0bar · 2024-10-18T08:33:16Z

Well thats one way to solve it - by dropping all rejected metrics only, that would certainly work for this use case as well (rather than relying on time based drop).

But... I would assume this could be difficult to generalize for many output plugins, as I imagine some external systems (in this case Azure) reject whole batch, some provide info what specifically they rejected and why, some dont etc.

No idea how Azure monitor output behaves for example... is it sent one by one metric? or full batch? What happens when only one metric in full batch is rejected ? Do we drop whole batch (dont think so)? How do we know which one caused the reject by service error if its only one out of many?

srebhan · 2024-10-18T11:01:41Z

Did you check the spec? It extends the current batch handling to be able to do exactly this, drop specific metrics of a batch or drop the whole batch or do whatever the output requires. Currently, you can either accept a batch (removing it from the buffer) or keep the batch (requeueing the whole thing on next write)...

Hr0bar · 2024-10-21T06:05:51Z

Yes, understood that, and yes that would be a working solution :)

I was just thinking about how that solution can be achieved as someone with no understanding of how it currently works ( I was under the impression that now whole batch is sent as a single payload to output destination - and theorizing whether if that would be the case if it needs to be split and sent metric by metric instead to output destination, to know exactly which metrics get rejected one by one so we can drop only specific metrics instead of whole batch)

srebhan · 2024-11-06T17:30:52Z

@Hr0bar I'm looking into this issue currently and I wonder how we know what the filter time should be? In your error message the valid time-range is 30 min ago and 4 min in the future but I cannot find that anywhere in the Azure documentation. Should this be a config option?

Hr0bar · 2024-11-07T06:07:05Z

Found this: https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-store-custom-rest-api "Each data point sent to Azure Monitor must be marked with a timestamp. This timestamp captures the date and time at which the metric value is measured or collected. Azure Monitor accepts metric data with timestamps as far as 20 minutes in the past and 5 minutes in the future. The timestamp must be in ISO 8601 format." So seems inconsistent on Microsoft side.

…

On Wed, Nov 6, 2024, 18:31 Sven Rebhan ***@***.***> wrote: @Hr0bar <https://github.com/Hr0bar> I'm looking into this issue currently and I wonder how we know what the filter time should be? In your error message the valid time-range is 30 min ago and 4 min in the future but I cannot find that anywhere in the Azure documentation. Should this be a config option? — Reply to this email directly, view it on GitHub <#15908 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTNGZWHJDWD5W24PW4XKBTZ7JG6FAVCNFSM6AAAAABOLD3U46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRQGM4DANJUGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

srebhan · 2024-11-11T10:13:44Z

So I think making this a config option is the right thing to do then with 20 min in the past and 4 min in the future (i.e. the intersection of both time-sets ;-)). Parsing this from the response might be another option but I do worry that this will be a fragile foot-gun...

Hr0bar · 2024-11-11T10:42:49Z

Agree, its likely not stable on Azure side (both the values & possible error msg string), so configurable with conservative intersection default makes sense!

srebhan · 2025-01-29T19:24:04Z

@Hr0bar sorry for this taking so long! Please test the binary in PR #16448, available once CI finished the tests, and let me know if this fixes the issue!

Hr0bar · 2025-01-30T07:26:54Z

Thanks!

I can try the binary, but dont think I can simulate the Azure (or network proxy etc) outages reliably to reproduce the issue for test. Since I filled this issue we experienced the issue zero times, only once Azure was Timeouting for about 45 minutes, but since some batches (about every tenth) went through and reduced the backlog, it did not trigger the cyclic infinite unsent metrics even then.

This would likely be best quick tested with some mock setup with different endpoint than Azure perhaps.

there is one typo in the PR docs "witin" should be "within"

I also see the 30 min more relaxed limit was chosen as default for past metrics, perhaps we could find out some official answer from Microsoft, instead of guessing whether the limit in returned error msg is correct or the one in their documentation. But I would trust the error msg more than docs indeed.

Anyway, the binary starts but fails after loading azure monitor plugin start for us:

Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7d4f6e]
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: goroutine 1 [running]:
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: net/http.(*Client).do(0x0, 0xc001e93680)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /usr/local/go/src/net/http/client.go:606 +0x1ee
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: net/http.(*Client).Do(...)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /usr/local/go/src/net/http/client.go:590
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/influxdata/telegraf/plugins/outputs/azure_monitor.vmInstanceMetadata(0x0)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/plugins/outputs/azure_monitor/azure_monitor.go:443 +0x205
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/influxdata/telegraf/plugins/outputs/azure_monitor.(*AzureMonitor).Connect(0xc002676b40)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/plugins/outputs/azure_monitor/azure_monitor.go:98 +0x46
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/influxdata/telegraf/models.(*RunningOutput).Connect(0xc0026a2840)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/models/running_output.go:170 +0x28
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/influxdata/telegraf/agent.(*Agent).connectOutput(0x0?, {0xae3d150, 0xc002690be0}, 0xc0026a2840)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/agent/agent.go:811 +0x15f
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/influxdata/telegraf/agent.(*Agent).startOutputs(0xc001d5c598, {0xae3d150, 0xc002690be0}, {0xc0026b00a8, 0x1, 0x0?})
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/agent/agent.go:787 +0xf0
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/influxdata/telegraf/agent.(*Agent).Run(0xc001d5c598, {0xae3d150, 0xc002690be0})
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/agent/agent.go:140 +0x57a
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: main.(*Telegraf).runAgent(0xc001acfa20, {0xae3d150, 0xc002690be0}, 0x0?)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:475 +0x19a5
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: main.(*Telegraf).reloadLoop(0xc001acfa20)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:206 +0x26b
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: main.(*Telegraf).Run(0xc001acfa20)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf_posix.go:19 +0xbc
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: main.runApp.func1(0xc001d5b880)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/cmd/telegraf/main.go:256 +0xd26
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/urfave/cli/v2.(*Command).Run(0xc0026a1760, 0xc001d5b880, {0xc0000b4050, 0x5, 0x5})
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:276 +0x7e2
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/urfave/cli/v2.(*App).RunContext(0xc00207aa00, {0xae3ceb0, 0x117d1f80}, {0xc0000b4050, 0x5, 0x5})
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:333 +0x58b
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: github.com/urfave/cli/v2.(*App).Run(...)
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:307
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: main.runApp({0xc0000b4050, 0x5, 0x5}, {0xad8f1c0, 0xc0000e2028}, {0xadc4cc0, 0xc0000e3c20}, {0xadc4ce8, 0xc002680180}, {0xae3cd60, ...})
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/cmd/telegraf/main.go:400 +0x1131
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]: main.main()
Jan 30 07:21:32 HOSTNAME.com telegraf[1198730]:         /go/src/github.com/influxdata/telegraf/cmd/telegraf/main.go:414 +0xe8

srebhan · 2025-01-30T08:03:46Z

Pushed a fix for the panic above. Sorry for not getting it right in the first place...

For simulation of the issue, you could simply increase the timestamp_limit_past to more than 30 minutes and inject an old metric into telegraf (e.g. using the file input)...

Hr0bar · 2025-01-30T09:03:30Z

I cannot trigger it with the old version (without this PR) and file/tail plugin, seems the plugin already filters too old metrics and drops them (doesnt aggregate), and the issue only happens when the plugin accepts them and aggregates them, but fails to write, then retries sending them after too long time ?

Tested the same with new (fixed panic) PR, same thing happens. Too old metrics arent even tried to be sent to Azure. Only current metrics are aggregated and sent (to output buffer).

Seems there were some protections already in place. So to test, I need to provide CURRENT timestamps on imput, simulate write error/timeout for long enough duration that the CURRENT timestamps become too OLD I think. Otherwise the protections in place will refuse them in the very beginning.

Correct me if I understood it wrongly.

Hr0bar · 2025-01-30T09:08:09Z

ahh, Ill try setting the timestamp_limit_past with new PR for like 1000 days in the past, and try it with that, did not realize new PR could modify the existing protections too, so they are not triggered, will try in a sec

Hr0bar · 2025-01-30T09:19:36Z

Okay, I get the:

2025-01-30T09:13:37Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}

when I set timestamp_limit_past to too old, and provide too old metrics on input.

It doesnt happen when set to default 30 minutes.

But its filtered out by the logic that was already present - old metrics arent even attempted to be sent. So not sure if this test helps in any way :/ As the same would happen in the old versions. And issue is only triggered with the network issues mentioned above (not filtered straight away, but buffered for retry until too much time passes).

Is there a way where I can load current timestamps into output buffer, so the plugin doesnt filter them straight away, but only send them after 30 minutes? With some high flush_interval or similar ?

srebhan · 2025-01-30T12:34:14Z

@Hr0bar well the problem is the following: Previously, the metrics are aggregated and the aggregates are filtered by the time limit just before being pushed to the buffer. I.e. aggregated metrics older than 30 minutes are dropped. Metrics newer than 1 minute in the past were held back. All aggregated metrics within the previous timespan are added to the buffer and are written. If that write fails (e.g. due to a connectivity issue) the metrics are still in the buffer and Telegraf will attempt to write them again in the next flush cycle because write returned an error.
Now assume flush cycles of 1 minute and a (aggregated) metric that is exactly 30 minutes old. At the first flush interval, the aggregated metric is "pushed" to the buffer as it is valid in terms of time. However, assume the write fails (e.g. due to connectivity or service availability), then the buffer will stay in the buffer, scheduled for the next write attempt. So at the next flush interval, the metric is 31 minutes old, but as there is no filtering it will be sent to Azure and the service returns the 400 error you see. As the Telegraf output-model will see a write error, it will keep the metric in the buffer for the next write. This continues eternally until the buffer is overflowing...

My change does two things. It will check the metric before each write! Therefore, the 31 minute old metric will be rejected and in turn dropped from the buffer. However, there is still a (small) chance of getting a 400 error from the service e.g. if the metric is exactly 30 min old when sending the metric the network latency will make the metric being older than 30 min on arrival. To also cover this case, I implemented a special handling for the 400 error which will reject all metrics sent in the batch and thus drop them from the buffer so they won't trigger the same issue in the next write.

For testing, there are two cases I would be interested in:

Check if the special handling of the 400 error is correct. This can be tested by injecting an old metric with the time-safe-guards being disabled (e.g. by making the metric 60 min old and setting the timestamp_limit_past to 120min) so the metric gets written to the service. This will cause a 400 error, but you should see the metric buffer **not ** getting filled over time.
Check if the time-filtering on write works correctly. To check this you would need to simulate a service outage or connectivity issue to Azure.

For the second test I can write a unit-test (and actually have done so). But the first test can only be performed by you as it involves the actual Azure service... So please

Set timestamp_limit_past = "120m"
Generate an old metric with a timestamp less than 120 min and more than 30 min in the past (e.g. 60m). This can be done using starlark or by reading a handcrafted metric with the tail plugin.
Check if the metric buffer fullness when running Telegraf in debug mode. The buffer fullness should not increase but stay constant or drop.

Hr0bar · 2025-01-30T13:17:08Z

Does this work? Did what you said I hope, with oneshot run and for 2 interval cycles in daemon mode as well:

[root@HOSTNAME ~]# telegraf --version
Telegraf 1.34.0-15cedf89 (git: pull/16448@15cedf89)
[root@HOSTNAME ~]# cat /tmp/telegraf.local_server.conf
[[inputs.file]]
  data_format = "influx"
  files = ["/tmp/telegraftest"]
  
[root@HOSTNAME ~]# cat "/tmp/telegraftest"
testplugin,test=test0 usage_active=11.9 1738238709000000000
testplugin,test=test0 usage_active=11.9 1738238709000000001
[root@HOSTNAME ~]# # above is 60 minutes in the past currently
[root@HOSTNAME ~]# cat /etc/telegraf/telegraf.conf

[global_tags]

[agent]
  debug = true
  interval = "1m"
  round_interval = true
  metric_batch_size = 100
  metric_buffer_limit = 1000
  collection_jitter = "0s"
  flush_interval = "1m"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.azure_monitor]]
  fieldexclude = ["storage_used","storage_total","percpu_load_percent"]
  tagexclude = ["storage_descr"]
  timeout = "10s"
  timestamp_limit_past = "120m"

[root@HOSTNAME ~]# telegraf --config /etc/telegraf/telegraf.conf -config /tmp/telegraf.local_server.conf --once
2025-01-30T13:08:39Z I! Loading config: /etc/telegraf/telegraf.conf
2025-01-30T13:08:39Z I! Loading config: /tmp/telegraf.local_server.conf
2025-01-30T13:08:39Z I! Starting Telegraf 1.34.0-15cedf89 brought to you by InfluxData the makers of InfluxDB
2025-01-30T13:08:39Z I! Available plugins: 237 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2025-01-30T13:08:39Z I! Loaded inputs: file
2025-01-30T13:08:39Z I! Loaded aggregators:
2025-01-30T13:08:39Z I! Loaded processors:
2025-01-30T13:08:39Z I! Loaded secretstores:
2025-01-30T13:08:39Z I! Loaded outputs: azure_monitor
2025-01-30T13:08:39Z I! Tags enabled: host=HOSTNAME.com
2025-01-30T13:08:39Z W! [agent] The default value of 'skip_processors_after_aggregators' will change to 'true' with Telegraf v1.40.0! If you need the current default behavior, please explicitly set the option to 'false'!
2025-01-30T13:08:39Z D! [agent] Initializing plugins
2025-01-30T13:08:39Z D! [agent] Connecting outputs
2025-01-30T13:08:39Z D! [agent] Attempting connection to [outputs.azure_monitor]
2025-01-30T13:08:39Z D! [outputs.azure_monitor] Writing to Azure Monitor URL: https://northeurope.monitoring.azure.com/subscriptions/censored/resourceGroups/censored/providers/Microsoft.Compute/virtualMachines/HOSTNAME/metrics
2025-01-30T13:08:39Z D! [agent] Successfully connected to outputs.azure_monitor
2025-01-30T13:08:39Z D! [agent] Starting service inputs
2025-01-30T13:08:39Z D! [agent] Stopping service inputs
2025-01-30T13:08:39Z D! [agent] Input channel closed
2025-01-30T13:08:39Z I! [agent] Hang on, flushing any cached metrics before shutdown
2025-01-30T13:08:39Z D! [outputs.azure_monitor] Buffer fullness: 0 / 1000 metrics
2025-01-30T13:08:39Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
2025-01-30T13:08:39Z I! [agent] Stopping running outputs
2025-01-30T13:08:39Z D! [agent] Stopped Successfully

[root@HOSTNAME ~]# telegraf --config /etc/telegraf/telegraf.conf -config /tmp/telegraf.local_server.conf
2025-01-30T13:10:20Z I! Loading config: /etc/telegraf/telegraf.conf
2025-01-30T13:10:20Z I! Loading config: /tmp/telegraf.local_server.conf
2025-01-30T13:10:20Z I! Starting Telegraf 1.34.0-15cedf89 brought to you by InfluxData the makers of InfluxDB
2025-01-30T13:10:20Z I! Available plugins: 237 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2025-01-30T13:10:20Z I! Loaded inputs: file
2025-01-30T13:10:20Z I! Loaded aggregators:
2025-01-30T13:10:20Z I! Loaded processors:
2025-01-30T13:10:20Z I! Loaded secretstores:
2025-01-30T13:10:20Z I! Loaded outputs: azure_monitor
2025-01-30T13:10:20Z I! Tags enabled: host=HOSTNAME.com
2025-01-30T13:10:20Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"HOSTNAME.com", Flush Interval:1m0s
2025-01-30T13:10:20Z W! [agent] The default value of 'skip_processors_after_aggregators' will change to 'true' with Telegraf v1.40.0! If you need the current default behavior, please explicitly set the option to 'false'!
2025-01-30T13:10:20Z D! [agent] Initializing plugins
2025-01-30T13:10:20Z D! [agent] Connecting outputs
2025-01-30T13:10:20Z D! [agent] Attempting connection to [outputs.azure_monitor]
2025-01-30T13:10:20Z D! [outputs.azure_monitor] Writing to Azure Monitor URL: https://northeurope.monitoring.azure.com/subscriptions/censored/resourceGroups/censored/providers/Microsoft.Compute/virtualMachines/HOSTNAME/metrics
2025-01-30T13:10:20Z D! [agent] Successfully connected to outputs.azure_monitor
2025-01-30T13:10:20Z D! [agent] Starting service inputs
2025-01-30T13:11:21Z D! [outputs.azure_monitor] Buffer fullness: 0 / 1000 metrics
2025-01-30T13:11:21Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
2025-01-30T13:12:20Z D! [outputs.azure_monitor] Buffer fullness: 0 / 1000 metrics
2025-01-30T13:12:20Z E! [agent] Error writing to outputs.azure_monitor: failed to write batch: [400] 400 Bad Request: {"error":{"code":"BadRequest","message":"'time' should not be older than 30 minutes and not more than 4 minutes in the future\r\n"}}
^C2025-01-30T13:12:30Z D! [agent] Stopping service inputs
2025-01-30T13:12:30Z D! [agent] Input channel closed
2025-01-30T13:12:30Z I! [agent] Hang on, flushing any cached metrics before shutdown
2025-01-30T13:12:30Z D! [outputs.azure_monitor] Buffer fullness: 0 / 1000 metrics
2025-01-30T13:12:30Z I! [agent] Stopping running outputs
2025-01-30T13:12:30Z D! [agent] Stopped Successfully
[root@HOSTNAME ~]#

srebhan · 2025-01-30T13:48:48Z

Seems to work! Thanks!

Hr0bar added the feature request Requests for new plugin and for new features to existing plugins label Sep 17, 2024

srebhan self-assigned this Oct 7, 2024

Hr0bar mentioned this issue Oct 9, 2024

docs(outputs): Clarify buffer limits behavior and fix spec wording #15999

Merged

1 task

srebhan mentioned this issue Oct 16, 2024

docs(specs): Add specification for partial-write errors #16034

Merged

1 task

srebhan mentioned this issue Nov 5, 2024

feat(outputs): Implement partial write errors #16146

Merged

1 task

srebhan linked a pull request Jan 29, 2025 that will close this issue

fix(outputs.azure_monitor): Prevent infinite send loop for outdated metrics #16448

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent indefinite unsent metrics loop in outpus.azure_monitor #15908

Prevent indefinite unsent metrics loop in outpus.azure_monitor #15908

Hr0bar commented Sep 17, 2024

Hr0bar commented Sep 17, 2024

srebhan commented Oct 2, 2024

srebhan commented Oct 2, 2024

Hr0bar commented Oct 2, 2024 •

edited

Loading

srebhan commented Oct 16, 2024

Hr0bar commented Oct 18, 2024

srebhan commented Oct 18, 2024

Hr0bar commented Oct 21, 2024

srebhan commented Nov 6, 2024

Hr0bar commented Nov 7, 2024 via email

srebhan commented Nov 11, 2024

Hr0bar commented Nov 11, 2024

srebhan commented Jan 29, 2025

Hr0bar commented Jan 30, 2025

srebhan commented Jan 30, 2025

Hr0bar commented Jan 30, 2025 •

edited

Loading

Hr0bar commented Jan 30, 2025

Hr0bar commented Jan 30, 2025

srebhan commented Jan 30, 2025

Hr0bar commented Jan 30, 2025

srebhan commented Jan 30, 2025

Prevent indefinite unsent metrics loop in outpus.azure_monitor #15908

Prevent indefinite unsent metrics loop in outpus.azure_monitor #15908

Comments

Hr0bar commented Sep 17, 2024

Use Case

Expected behavior

Actual behavior

Additional info

Hr0bar commented Sep 17, 2024

srebhan commented Oct 2, 2024

srebhan commented Oct 2, 2024

Hr0bar commented Oct 2, 2024 • edited Loading

srebhan commented Oct 16, 2024

Hr0bar commented Oct 18, 2024

srebhan commented Oct 18, 2024

Hr0bar commented Oct 21, 2024

srebhan commented Nov 6, 2024

Hr0bar commented Nov 7, 2024 via email

srebhan commented Nov 11, 2024

Hr0bar commented Nov 11, 2024

srebhan commented Jan 29, 2025

Hr0bar commented Jan 30, 2025

srebhan commented Jan 30, 2025

Hr0bar commented Jan 30, 2025 • edited Loading

Hr0bar commented Jan 30, 2025

Hr0bar commented Jan 30, 2025

srebhan commented Jan 30, 2025

Hr0bar commented Jan 30, 2025

srebhan commented Jan 30, 2025

Hr0bar commented Oct 2, 2024 •

edited

Loading

Hr0bar commented Jan 30, 2025 •

edited

Loading