node_metrics output plugin producing stale data for no longer existent network devices #9400

ElectricWeasel · 2024-09-18T09:16:31Z

Bug Report

node_metrics output plugin producing stale data for no longer existent network devices, it could be observed in mimir logs and file output dump.
It is triggered somehow by veth* virtual network devices created for docker containers, metrics are repeatedly send long after device is gone. We are using docker nodes in Swarm mode to run application build and tests (Jenkins agents) so containers are short lived instances.

To Reproduce

create docker container with network
metrics dumped by file output (correct date on host 2024-09-18T06:23:54):

2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="lo"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="enp2s0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="wlp3s0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="docker0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="docker_gwbridge"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="veth9131746"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="vethceee685"} = 0
2024-09-17T13:02:54.163178349Z node_network_transmit_compressed_total{device="veth4f24bf3"} = 0
2024-09-17T13:02:54.163178349Z node_network_transmit_compressed_total{device="veth8d92a60"} = 0
2024-09-17T13:04:09.163346349Z node_network_transmit_compressed_total{device="vethb672c5c"} = 0
2024-09-17T13:20:24.163096825Z node_network_transmit_compressed_total{device="veth204c129"} = 0
2024-09-17T13:19:24.320445397Z node_network_transmit_compressed_total{device="veth80e9d0a"} = 0
2024-09-17T13:29:54.162821225Z node_network_transmit_compressed_total{device="veth3dc421c"} = 0
2024-09-17T13:38:09.162996512Z node_network_transmit_compressed_total{device="veth7c25eb8"} = 0
2024-09-17T13:38:09.162996512Z node_network_transmit_compressed_total{device="veth6597216"} = 0
2024-09-17T13:50:24.163177111Z node_network_transmit_compressed_total{device="veth097250a"} = 0
2024-09-17T13:53:54.162955756Z node_network_transmit_compressed_total{device="vethe738f49"} = 0
2024-09-17T13:56:24.162887438Z node_network_transmit_compressed_total{device="vethc13dfc2"} = 0
2024-09-17T13:58:24.162943862Z node_network_transmit_compressed_total{device="vethdb04c37"} = 0
2024-09-17T13:58:39.163101877Z node_network_transmit_compressed_total{device="veth49217c9"} = 0
2024-09-17T14:00:54.163102836Z node_network_transmit_compressed_total{device="vethf93b1c7"} = 0
2024-09-17T14:56:09.163110435Z node_network_transmit_compressed_total{device="veth3f0323e"} = 0
2024-09-17T15:41:39.163064514Z node_network_transmit_compressed_total{device="vetha81d561"} = 0
2024-09-17T15:54:09.163025023Z node_network_transmit_compressed_total{device="vethe85281d"} = 0
2024-09-18T06:23:54.162837303Z node_memory_MemTotal_bytes = 16392421376

example mimir logs

failed pushing to ingester opentelemetry-mimir-3: user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-17T13:04:09.163Z and is from series node_network_transmit_errs_total{device="vethb672c5c", host_name="xxxx.xxx.xxxx", metrics_agent="fluent-bit", metrics_source="host-metrics"}

Steps to reproduce the problem:

Expected behavior
No stale metrics delivered

Environment

Version used: fluent-bit-3.1.7-1.x86_64
Configuration:

[SERVICE]
    # Flush
    # =====
    # set an interval of seconds before to flush records to a destination
    flush 1

    # Daemon
    # ======
    # instruct Fluent Bit to run in foreground or background mode.
    daemon Off

    # Log_Level
    # =========
    # Set the verbosity level of the service, values can be:
    #
    # - error
    # - warning
    # - info
    # - debug
    # - trace
    #
    # by default 'info' is set, that means it includes 'error' and 'warning'.
    log_level debug

    # Parsers File
    # ============
    # specify an optional 'Parsers' configuration file
    parsers_file parsers.conf
    parsers_file parsers-custom.conf

    # Plugins File
    # ============
    # specify an optional 'Plugins' configuration file to load external plugins.
    plugins_file plugins.conf

    # HTTP Server
    # ===========
    # Enable/Disable the built-in HTTP Server for metrics
    http_server  Off
    http_listen  0.0.0.0
    http_port    2020

    # Storage
    # =======
    # Fluent Bit can use memory and filesystem buffering based mechanisms
    #
    # - https://docs.fluentbit.io/manual/administration/buffering-and-storage
    #
    # storage metrics
    # ---------------
    # publish storage pipeline metrics in '/api/v1/storage'. The metrics are
    # exported only if the 'http_server' option is enabled.
    storage.metrics on

    # storage.path
    # ------------
    # absolute file system path to store filesystem data buffers (chunks).
    #
    storage.path /var/lib/fluent-bit/storage

    # storage.sync
    # ------------
    # configure the synchronization mode used to store the data into the
    # filesystem. It can take the values normal or full.
    #
    storage.sync normal

    # storage.checksum
    # ----------------
    # enable the data integrity check when writing and reading data from the
    # filesystem. The storage layer uses the CRC32 algorithm.
    #
    # storage.checksum off

    # storage.backlog.mem_limit
    # -------------------------
    # if storage.path is set, Fluent Bit will look for data chunks that were
    # not delivered and are still in the storage layer, these are called
    # backlog data. This option configure a hint of maximum value of memory
    # to use when processing these records.
    #
    # storage.backlog.mem_limit 5M
    storage.total_limit_size 512M
    storage.max_chunks_up 128

# Systemd services logs (docker)
[INPUT]
    Name systemd
    Tag systemd.*
    Systemd_Filter _SYSTEMD_UNIT=docker.service
    Lowercase on
    Strip_Underscores on
    DB /var/lib/fluent-bit/cursors/systemd.sqlite
    storage.type filesystem

[INPUT]
    Name                 node_exporter_metrics
    Tag                  node_metrics
    metrics "cpu,meminfo,diskstats,filesystem,uname,stat,time,loadavg,vmstat,netdev,filefd"
    Scrape_interval      15

# Forward/fluentd input for docker services logging
[INPUT]
    Name forward
    Unix_Path /run/fluentd-forward.sock
    Unix_Perm 0666
    storage.type filesystem

[OUTPUT]
    Match systemd.*
    Name opensearch
    Host xxxxx.xxx.xxxxxx
    Port 443
    HTTP_User fluentbit
    HTTP_Passwd xxxxxxxx
    Index systemd
    Suppress_Type_Name On
    Tls On

[OUTPUT]
    Name opentelemetry
    Match node_metrics
    Host xxx.xxx.xxx
    Port 443
    Log_response_payload False
    Tls                  On
    logs_body_key $message
    logs_span_id_message_key span_id
    logs_trace_id_message_key trace_id
    logs_severity_text_message_key loglevel
    logs_severity_number_message_key lognum
    # add user-defined labels
    add_label metrics_agent fluent-bit
    add_label metrics_source host-metrics
    add_label host_name xxxx.xxx.xxx

[OUTPUT]
    Name file
    Match node_metrics
    Path /var/log
    File metrics.log

Environment name and version: Docker CE docker-ce-25.0.3-1.el9.x86_64
Server type and version: Dell Inspiron 5577
Operating System and version: AlmaLinux 9.3
Filters and plugins:
- input node_exporter
- output opentelemetry
- output file (for debug)

The text was updated successfully, but these errors were encountered:

cosmo0920 · 2024-09-25T04:42:29Z

We already cut off the outdated metrics for prometheus remote write. Then, we should do the similar operations in otel encoder stuffs.

cosmo0920 · 2024-09-25T05:54:47Z

https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020
Prometheus mimir is quite tight to validate outdated metrics.
It's only permitted for within 5 minutes in the same batch.

ElectricWeasel · 2024-09-25T07:22:43Z

https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020 Prometheus mimir is quite tight to validate outdated metrics. It's only permitted for within 5 minutes in the same batch.

Please note that metrics are coming even from a previous day, i think problem lies somewhere in network device detection - when is removed. Only metrics for "netdev" are broken.

cosmo0920 · 2024-09-25T08:17:20Z

https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020 Prometheus mimir is quite tight to validate outdated metrics. It's only permitted for within 5 minutes in the same batch.

Please note that metrics are coming even from a previous day, i think problem lies somewhere in network device detection - when is removed. Only metrics for "netdev" are broken.

This is because our metrics handling system could be persisting the previously registered metrics. This could be observed as too old metrics. If they were registered yesterday and none of updated until the traversing, this could be observed as they are observed on yesterday.

ElectricWeasel · 2024-09-25T08:23:39Z

i just noticed same behaviour on filesystem metrics, error logged at "Sep 25, 2024 @ 10:12:32.045"

user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:50:15.946Z and is from series node_filesystem_device_error{device="tmpfs", fstype="tmpfs", host_name="petra6.xxx.xxx", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"}

/run/user/2137 is a ephemeral filesystem created during session, that cease to exist after logout, yet metrics with - probably - last available time when it was mounted are propagated.

I think it shouldn't be reported anymore?

cosmo0920 · 2024-09-25T09:35:30Z

For sending as otel payloads, this could be having the same root cause. This could be happened by otel encoder on cmetrics which is one of the fundamental libraries on fluent-bit's observabilities.

ElectricWeasel · 2024-09-25T10:18:01Z

Is there anything I can do to help resolve this?

edsiper · 2024-09-26T18:14:23Z

This is being handled in the underlaying library here:

fluent/cmetrics#223

I added some comments; maybe @ElectricWeasel you can jump in. While the suggested solution does a cutoff of metrics based on the elapsed timestamp, I am not 100% if this is the desired behavior for everybody,

if fluent/cmetrics#223 moves forward I think it should not add a breaking change, however expose a configuration that we can manage from out_opentelemetry for users who needs granular control of this.

cosmo0920 · 2024-09-27T11:20:32Z

Ah, yes. We need to provide an option to choose whether using the cutoff or not. I'll implement on it.

ElectricWeasel added the status: waiting-for-triage label Sep 18, 2024

cosmo0920 mentioned this issue Sep 25, 2024

encode_opentelemetry: add cut off for otel payloads for prometheus mimir fluent/cmetrics#223

Open

edsiper added this to the Fluent Bit v3.2.0 milestone Sep 26, 2024

edsiper added backport to v3.1.x and removed status: waiting-for-triage labels Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_metrics output plugin producing stale data for no longer existent network devices #9400

node_metrics output plugin producing stale data for no longer existent network devices #9400

ElectricWeasel commented Sep 18, 2024 •

edited

Loading

cosmo0920 commented Sep 25, 2024

cosmo0920 commented Sep 25, 2024

ElectricWeasel commented Sep 25, 2024

cosmo0920 commented Sep 25, 2024

ElectricWeasel commented Sep 25, 2024

cosmo0920 commented Sep 25, 2024

ElectricWeasel commented Sep 25, 2024

edsiper commented Sep 26, 2024

cosmo0920 commented Sep 27, 2024

node_metrics output plugin producing stale data for no longer existent network devices #9400

node_metrics output plugin producing stale data for no longer existent network devices #9400

Comments

ElectricWeasel commented Sep 18, 2024 • edited Loading

Bug Report

cosmo0920 commented Sep 25, 2024

cosmo0920 commented Sep 25, 2024

ElectricWeasel commented Sep 25, 2024

cosmo0920 commented Sep 25, 2024

ElectricWeasel commented Sep 25, 2024

cosmo0920 commented Sep 25, 2024

ElectricWeasel commented Sep 25, 2024

edsiper commented Sep 26, 2024

cosmo0920 commented Sep 27, 2024

ElectricWeasel commented Sep 18, 2024 •

edited

Loading