Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_metrics output plugin producing stale data for no longer existent network devices #9400

Open
ElectricWeasel opened this issue Sep 18, 2024 · 9 comments

Comments

@ElectricWeasel
Copy link

ElectricWeasel commented Sep 18, 2024

Bug Report

node_metrics output plugin producing stale data for no longer existent network devices, it could be observed in mimir logs and file output dump.
It is triggered somehow by veth* virtual network devices created for docker containers, metrics are repeatedly send long after device is gone. We are using docker nodes in Swarm mode to run application build and tests (Jenkins agents) so containers are short lived instances.

To Reproduce

  • create docker container with network
  • metrics dumped by file output (correct date on host 2024-09-18T06:23:54):
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="lo"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="enp2s0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="wlp3s0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="docker0"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="docker_gwbridge"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="veth9131746"} = 0
2024-09-18T06:23:54.163003369Z node_network_transmit_compressed_total{device="vethceee685"} = 0
2024-09-17T13:02:54.163178349Z node_network_transmit_compressed_total{device="veth4f24bf3"} = 0
2024-09-17T13:02:54.163178349Z node_network_transmit_compressed_total{device="veth8d92a60"} = 0
2024-09-17T13:04:09.163346349Z node_network_transmit_compressed_total{device="vethb672c5c"} = 0
2024-09-17T13:20:24.163096825Z node_network_transmit_compressed_total{device="veth204c129"} = 0
2024-09-17T13:19:24.320445397Z node_network_transmit_compressed_total{device="veth80e9d0a"} = 0
2024-09-17T13:29:54.162821225Z node_network_transmit_compressed_total{device="veth3dc421c"} = 0
2024-09-17T13:38:09.162996512Z node_network_transmit_compressed_total{device="veth7c25eb8"} = 0
2024-09-17T13:38:09.162996512Z node_network_transmit_compressed_total{device="veth6597216"} = 0
2024-09-17T13:50:24.163177111Z node_network_transmit_compressed_total{device="veth097250a"} = 0
2024-09-17T13:53:54.162955756Z node_network_transmit_compressed_total{device="vethe738f49"} = 0
2024-09-17T13:56:24.162887438Z node_network_transmit_compressed_total{device="vethc13dfc2"} = 0
2024-09-17T13:58:24.162943862Z node_network_transmit_compressed_total{device="vethdb04c37"} = 0
2024-09-17T13:58:39.163101877Z node_network_transmit_compressed_total{device="veth49217c9"} = 0
2024-09-17T14:00:54.163102836Z node_network_transmit_compressed_total{device="vethf93b1c7"} = 0
2024-09-17T14:56:09.163110435Z node_network_transmit_compressed_total{device="veth3f0323e"} = 0
2024-09-17T15:41:39.163064514Z node_network_transmit_compressed_total{device="vetha81d561"} = 0
2024-09-17T15:54:09.163025023Z node_network_transmit_compressed_total{device="vethe85281d"} = 0
2024-09-18T06:23:54.162837303Z node_memory_MemTotal_bytes = 16392421376
  • example mimir logs
failed pushing to ingester opentelemetry-mimir-3: user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-17T13:04:09.163Z and is from series node_network_transmit_errs_total{device="vethb672c5c", host_name="xxxx.xxx.xxxx", metrics_agent="fluent-bit", metrics_source="host-metrics"}
  • Steps to reproduce the problem:

Expected behavior
No stale metrics delivered

Environment

  • Version used: fluent-bit-3.1.7-1.x86_64
  • Configuration:
[SERVICE]
    # Flush
    # =====
    # set an interval of seconds before to flush records to a destination
    flush 1

    # Daemon
    # ======
    # instruct Fluent Bit to run in foreground or background mode.
    daemon Off

    # Log_Level
    # =========
    # Set the verbosity level of the service, values can be:
    #
    # - error
    # - warning
    # - info
    # - debug
    # - trace
    #
    # by default 'info' is set, that means it includes 'error' and 'warning'.
    log_level debug

    # Parsers File
    # ============
    # specify an optional 'Parsers' configuration file
    parsers_file parsers.conf
    parsers_file parsers-custom.conf

    # Plugins File
    # ============
    # specify an optional 'Plugins' configuration file to load external plugins.
    plugins_file plugins.conf

    # HTTP Server
    # ===========
    # Enable/Disable the built-in HTTP Server for metrics
    http_server  Off
    http_listen  0.0.0.0
    http_port    2020

    # Storage
    # =======
    # Fluent Bit can use memory and filesystem buffering based mechanisms
    #
    # - https://docs.fluentbit.io/manual/administration/buffering-and-storage
    #
    # storage metrics
    # ---------------
    # publish storage pipeline metrics in '/api/v1/storage'. The metrics are
    # exported only if the 'http_server' option is enabled.
    storage.metrics on

    # storage.path
    # ------------
    # absolute file system path to store filesystem data buffers (chunks).
    #
    storage.path /var/lib/fluent-bit/storage

    # storage.sync
    # ------------
    # configure the synchronization mode used to store the data into the
    # filesystem. It can take the values normal or full.
    #
    storage.sync normal

    # storage.checksum
    # ----------------
    # enable the data integrity check when writing and reading data from the
    # filesystem. The storage layer uses the CRC32 algorithm.
    #
    # storage.checksum off

    # storage.backlog.mem_limit
    # -------------------------
    # if storage.path is set, Fluent Bit will look for data chunks that were
    # not delivered and are still in the storage layer, these are called
    # backlog data. This option configure a hint of maximum value of memory
    # to use when processing these records.
    #
    # storage.backlog.mem_limit 5M
    storage.total_limit_size 512M
    storage.max_chunks_up 128

# Systemd services logs (docker)
[INPUT]
    Name systemd
    Tag systemd.*
    Systemd_Filter _SYSTEMD_UNIT=docker.service
    Lowercase on
    Strip_Underscores on
    DB /var/lib/fluent-bit/cursors/systemd.sqlite
    storage.type filesystem

[INPUT]
    Name                 node_exporter_metrics
    Tag                  node_metrics
    metrics "cpu,meminfo,diskstats,filesystem,uname,stat,time,loadavg,vmstat,netdev,filefd"
    Scrape_interval      15

# Forward/fluentd input for docker services logging
[INPUT]
    Name forward
    Unix_Path /run/fluentd-forward.sock
    Unix_Perm 0666
    storage.type filesystem

[OUTPUT]
    Match systemd.*
    Name opensearch
    Host xxxxx.xxx.xxxxxx
    Port 443
    HTTP_User fluentbit
    HTTP_Passwd xxxxxxxx
    Index systemd
    Suppress_Type_Name On
    Tls On

[OUTPUT]
    Name opentelemetry
    Match node_metrics
    Host xxx.xxx.xxx
    Port 443
    Log_response_payload False
    Tls                  On
    logs_body_key $message
    logs_span_id_message_key span_id
    logs_trace_id_message_key trace_id
    logs_severity_text_message_key loglevel
    logs_severity_number_message_key lognum
    # add user-defined labels
    add_label metrics_agent fluent-bit
    add_label metrics_source host-metrics
    add_label host_name xxxx.xxx.xxx

[OUTPUT]
    Name file
    Match node_metrics
    Path /var/log
    File metrics.log
  • Environment name and version: Docker CE docker-ce-25.0.3-1.el9.x86_64
  • Server type and version: Dell Inspiron 5577
  • Operating System and version: AlmaLinux 9.3
  • Filters and plugins:
    • input node_exporter
    • output opentelemetry
    • output file (for debug)
@cosmo0920
Copy link
Contributor

We already cut off the outdated metrics for prometheus remote write. Then, we should do the similar operations in otel encoder stuffs.

@cosmo0920
Copy link
Contributor

https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020
Prometheus mimir is quite tight to validate outdated metrics.
It's only permitted for within 5 minutes in the same batch.

@ElectricWeasel
Copy link
Author

https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020 Prometheus mimir is quite tight to validate outdated metrics. It's only permitted for within 5 minutes in the same batch.

Please note that metrics are coming even from a previous day, i think problem lies somewhere in network device detection - when is removed. Only metrics for "netdev" are broken.

@cosmo0920
Copy link
Contributor

https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020 Prometheus mimir is quite tight to validate outdated metrics. It's only permitted for within 5 minutes in the same batch.

Please note that metrics are coming even from a previous day, i think problem lies somewhere in network device detection - when is removed. Only metrics for "netdev" are broken.

This is because our metrics handling system could be persisting the previously registered metrics. This could be observed as too old metrics. If they were registered yesterday and none of updated until the traversing, this could be observed as they are observed on yesterday.

@ElectricWeasel
Copy link
Author

i just noticed same behaviour on filesystem metrics, error logged at "Sep 25, 2024 @ 10:12:32.045"

user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:50:15.946Z and is from series node_filesystem_device_error{device="tmpfs", fstype="tmpfs", host_name="petra6.xxx.xxx", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"}

/run/user/2137 is a ephemeral filesystem created during session, that cease to exist after logout, yet metrics with - probably - last available time when it was mounted are propagated.

I think it shouldn't be reported anymore?

@cosmo0920
Copy link
Contributor

For sending as otel payloads, this could be having the same root cause. This could be happened by otel encoder on cmetrics which is one of the fundamental libraries on fluent-bit's observabilities.

@ElectricWeasel
Copy link
Author

Is there anything I can do to help resolve this?

@edsiper
Copy link
Member

edsiper commented Sep 26, 2024

This is being handled in the underlaying library here:

fluent/cmetrics#223

I added some comments; maybe @ElectricWeasel you can jump in. While the suggested solution does a cutoff of metrics based on the elapsed timestamp, I am not 100% if this is the desired behavior for everybody,

if fluent/cmetrics#223 moves forward I think it should not add a breaking change, however expose a configuration that we can manage from out_opentelemetry for users who needs granular control of this.

@cosmo0920
Copy link
Contributor

Ah, yes. We need to provide an option to choose whether using the cutoff or not. I'll implement on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants