-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node_metrics output plugin producing stale data for no longer existent network devices #9400
Comments
We already cut off the outdated metrics for prometheus remote write. Then, we should do the similar operations in otel encoder stuffs. |
https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020 |
Please note that metrics are coming even from a previous day, i think problem lies somewhere in network device detection - when is removed. Only metrics for "netdev" are broken. |
This is because our metrics handling system could be persisting the previously registered metrics. This could be observed as too old metrics. If they were registered yesterday and none of updated until the traversing, this could be observed as they are observed on yesterday. |
i just noticed same behaviour on filesystem metrics, error logged at "Sep 25, 2024 @ 10:12:32.045"
/run/user/2137 is a ephemeral filesystem created during session, that cease to exist after logout, yet metrics with - probably - last available time when it was mounted are propagated. I think it shouldn't be reported anymore? |
For sending as otel payloads, this could be having the same root cause. This could be happened by otel encoder on cmetrics which is one of the fundamental libraries on fluent-bit's observabilities. |
Is there anything I can do to help resolve this? |
This is being handled in the underlaying library here: I added some comments; maybe @ElectricWeasel you can jump in. While the suggested solution does a cutoff of metrics based on the elapsed timestamp, I am not 100% if this is the desired behavior for everybody, if fluent/cmetrics#223 moves forward I think it should not add a breaking change, however expose a configuration that we can manage from out_opentelemetry for users who needs granular control of this. |
Ah, yes. We need to provide an option to choose whether using the cutoff or not. I'll implement on it. |
Bug Report
node_metrics output plugin producing stale data for no longer existent network devices, it could be observed in mimir logs and file output dump.
It is triggered somehow by veth* virtual network devices created for docker containers, metrics are repeatedly send long after device is gone. We are using docker nodes in Swarm mode to run application build and tests (Jenkins agents) so containers are short lived instances.
To Reproduce
Expected behavior
No stale metrics delivered
Environment
The text was updated successfully, but these errors were encountered: