Agent gets unhealthy temporarily because Beat monitoring sockets are not available #5332

amolnater-qasource · 2024-08-21T09:50:21Z

Kibana Build details:

VERSION: 8.16.0 SNAPSHOT
BUILD: 77469
COMMIT: 1f1a3594615afceda26308abf163497114b5858f

Preconditions:

8.16.0 SNAPSHOT Kibana cloud environment should be available.
Remote Elasticsearch Output should be added.
Agent should be installed with policy having System integration.

Steps to reproduce:

Navigate to Agent policies tab.
Now update output for system integration and set Remote Elasticsearch.
Observe agent gets unhealthy.
However observe data on the remote ES cluster is available.

NOTE:

No impact on data is observed.
Issue is consistently reproducible on two different agents.

Expected Result:
Agent should remain healthy on switching integration output to Remote ES.

Logs:
elastic-agent-diagnostics-2024-08-21T09-40-35Z-00.zip

Screen Recording:

ip-172-31-18-23.-.Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-08-21.15-09-58.mp4

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-08-21.15-11-25.mp4

Feature:
elastic/kibana#143905

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-08-21T09:50:23Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

amolnater-qasource · 2024-08-21T09:50:34Z

@karanbirsingh-qasource Please review.

ghost · 2024-08-21T10:39:33Z

secondary review of this ticket is done

pierrehilbert · 2024-08-28T08:52:17Z

Are we still unhealthy after restarting the Agent?
Are we having the same behavior when we are directly setting up the Remote Elasticsearch instead of switching to it?

amolnater-qasource · 2024-08-28T09:16:19Z

Hi @pierrehilbert
Thank you for looking into this.

Are we still unhealthy after restarting the Agent?

No, the agent gets back healthy after sometime.
On restart too, agent immediately gets back Healthy.

Are we having the same behavior when we are directly setting up the Remote Elasticsearch instead of switching to it?

We have revalidated by setting the Remote Elasticsearch output first and then installing the agent.
We have found it working fine and observed agent remains Healthy throughout.

Agent Logs:
elastic-agent-diagnostics-2024-08-28T09-15-50Z-00.zip

Please let us know if we anything else is required from our end.
Thanks!!

cmacknz · 2024-09-03T17:34:57Z

Looking at logs this doesn't have anything to do with remote ES and is caused by the monitoring endpoints of the Beats not being available fast enough:

logs/elastic-agent-8.16.0-SNAPSHOT-063287/elastic-agent-20240821.ndjson
3278:{"log.level":"warn","@timestamp":"2024-08-21T09:36:49.086Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->DEGRADED): Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /opt/Elastic/Agent/data/tmp/YnNOCxAwgihP9s-7dwAEjXjR-siLP2_L.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
3279:{"log.level":"warn","@timestamp":"2024-08-21T09:36:49.086Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->DEGRADED): Error fetching data for metricset http.json: error making http request: Get \"http://unix/stats\": dial unix /opt/Elastic/Agent/data/tmp/DaWi_h1Ho3tUlAynCECuHtgXQYuKcWEJ.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"DEGRADED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
3338:{"log.level":"info","@timestamp":"2024-08-21T09:36:49.651Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (DEGRADED->CONFIGURING): Configuring","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
3352:{"log.level":"info","@timestamp":"2024-08-21T09:36:49.960Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (DEGRADED->CONFIGURING): Configuring","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
3366:{"log.level":"warn","@timestamp":"2024-08-21T09:36:50.657Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->DEGRADED): Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /opt/Elastic/Agent/data/tmp/YnNOCxAwgihP9s-7dwAEjXjR-siLP2_L.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"DEGRADED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
3367:{"log.level":"warn","@timestamp":"2024-08-21T09:36:50.671Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->DEGRADED): Error fetching data for metricset http.json: error making http request: Get \"http://unix/inputs\": dial unix /opt/Elastic/Agent/data/tmp/YnNOCxAwgihP9s-7dwAEjXjR-siLP2_L.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"DEGRADED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
3405:{"log.level":"info","@timestamp":"2024-08-21T09:36:59.352Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (DEGRADED->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
3406:{"log.level":"info","@timestamp":"2024-08-21T09:36:59.410Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (DEGRADED->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metri

ycombinator · 2024-10-19T00:34:32Z

is caused by the monitoring endpoints of the Beats not being available fast enough:

@cmacknz is there any work to be done here or is this normal, expected behavior?

cmacknz · 2024-10-21T15:44:50Z

This behavior is currently causing flakiness in our integration tests so I think we should address is.

This is the first example I've seen outside of our tests, which means any fix can't be in our test infrastructure it has to go in agent itself.

I think the rate at which we poll for this endpoint to exist is the new 60s metrics interval, so you can have to poll the agent health for 1+ minute to see if it recovers.

A fix might look like retrying the connection of this socket faster in the agent. It is also possible beat startup has gotten slower.

ycombinator · 2024-10-21T16:53:58Z

Added it to the current sprint.

pchila · 2024-11-06T14:33:15Z

Had a quick look around the code of elastic-agent and beats to try and sum up the issue:

@cmacknz is right in saying that the degraded state is coming from the http endpoints not being available because of a restart due to output change, the involved snippets of code are:

elastic-agent

elastic-agent/internal/pkg/agent/application/monitoring/v1_monitor.go

Lines 647 to 675 in 8c43ad3

    
           componentListWithMonitoring := map[string]string{ 
        
           	fmt.Sprintf("beat/%s", monitoringMetricsUnitID): "metricbeat", 
        
           	fmt.Sprintf("http/%s", monitoringMetricsUnitID): "metricbeat", 
        
           	monitoringFilesUnitsID:                          "filebeat", 
        
           } 
        
           for k, v := range componentIDToBinary { 
        
           	componentListWithMonitoring[k] = v 
        
           } 
        
           for unit, binaryName := range componentListWithMonitoring { 
        
           	if !isSupportedMetricsBinary(binaryName) { 
        
           		continue 
        
           	} 
        
           	endpoints := []interface{}{prefixedEndpoint(utils.SocketURLWithFallback(unit, paths.TempDir()))} 
        
           	name := strings.ReplaceAll(strings.ReplaceAll(binaryName, "-", "_"), "/", "_") // conform with index naming policy 
        
           	if isSupportedBeatsBinary(binaryName) { 
        
           		beatsStreams = append(beatsStreams, map[string]interface{}{ 
        
           			idKey: fmt.Sprintf("%s-", monitoringMetricsUnitID) + name, 
        
           			"data_stream": map[string]interface{}{ 
        
           				"type":      "metrics", 
        
           				"dataset":   fmt.Sprintf("elastic_agent.%s", name), 
        
           				"namespace": monitoringNamespace, 
        
           			}, 
        
           			"metricsets": []interface{}{"stats"}, 
        
           			"hosts":      endpoints, 
        
           			"period":     metricsCollectionIntervalString, 
        
           			"index":      fmt.Sprintf("metrics-elastic_agent.%s-%s", name, monitoringNamespace),

beats https://github.com/elastic/beats/blob/1f033c9eed2ccd5b5acca830b47ec3c2261ef026/metricbeat/mb/module/wrapper.go#L254-L269
So metricbeat is setting itself as degraded and will reset to healthy only on the next metrics fetch (which will happen by default after 60s)

Changing the interval between metrics is already possible via configuration in the monitoring config

elastic-agent/internal/pkg/agent/application/monitoring/v1_monitor.go

Lines 149 to 153 in 8c43ad3

    
           if metricsPeriod, found := monitoringMap[monitoringMetricsPeriodKey]; found { 
        
           	if metricsPeriodStr, ok := metricsPeriod.(string); ok { 
        
           		metricsCollectionIntervalString = metricsPeriodStr 
        
           	} 
        
           }

At this point I see a few options:

Reduce the default interval for metrics collection: we will lose some or all the savings implemented with Increase metrics interval to 60s #3578
if the issue is only in tests: a simple monitoring config is the best bet as it doesn't impact anyone else and doesn't require code changes in production code
make the state change based on metricset fetch outcome configurable and disable it for the monitoring inputs so that agent will report the monitoring beats degraded only when those are missing check-ins
Implement a retry mechanisms in metricbeat for failed metricsets:
1. Retry n times within a reasonable span of time (every <configurable number> seconds up to some deadline < normal metric interval)
2. Allow metricsets fetch to reschedule themselves with a shorter interval in case of error: this will have an impact on the regular sampling that is used right now in metricbeat with some consequences in case of aggregations that assume regular spacing between metricsets

@ycombinator @pierrehilbert @jlind23 @cmacknz what do you think would be an acceptable solution for this ?
Can we sacrifice some of the ES space saved by slowing the metricsets ingestion in ES in favor of better responsiveness ?
Is it ok to introduce some variance in the cadence of metricsets ingestion in case of errors ?

pchila · 2024-11-06T15:04:22Z

make the state change based on metricset fetch outcome configurable and disable it for the monitoring inputs so that agent will report the monitoring beats degraded only when those are missing check-ins

A very valid suggestion from @pkoutsovasilis that can be added to the list of options:
we could also configure the number of failures before metricbeat sets itself as DEGRADED (for example 2-3 failures in a row for our monitoring inputs) so that we can avoid showing DEGRADED status for transient errors

cmacknz · 2024-11-06T20:07:45Z

if the issue is only in tests:

This bug report is an instance of this happening outside of our tests, our tests are just built to detect this easily, and I think the problem is ultimately transient.

Can we sacrifice some of the ES space saved by slowing the metricsets ingestion in ES in favor of better responsiveness ?

No, the space savings are cost savings in the end.

we could also configure the number of failures before metricbeat sets itself as DEGRADED (for example 2-3 failures in a row for our monitoring inputs) so that we can avoid showing DEGRADED status for transient errors

I like this, seems like simple and works as long as this problem is self-correcting.

blakerouse · 2024-11-06T20:52:05Z

Or another correct solution would be to stop restarting the beats when the output changes ;-)

amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Aug 21, 2024

ycombinator assigned pchila Oct 21, 2024

cmacknz changed the title ~~Agent gets unhealthy temporarily on switching integration output to Remote ES.~~ Agent gets unhealthy temporarily because Beat monitoring sockets are not available Oct 21, 2024

pchila mentioned this issue Nov 8, 2024

Metricbeat: add configurable failure threshold before reporting streams as degraded elastic/beats#41570

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent gets unhealthy temporarily because Beat monitoring sockets are not available #5332

Agent gets unhealthy temporarily because Beat monitoring sockets are not available #5332

amolnater-qasource commented Aug 21, 2024

elasticmachine commented Aug 21, 2024

amolnater-qasource commented Aug 21, 2024

ghost commented Aug 21, 2024

pierrehilbert commented Aug 28, 2024

amolnater-qasource commented Aug 28, 2024

cmacknz commented Sep 3, 2024

ycombinator commented Oct 19, 2024

cmacknz commented Oct 21, 2024

ycombinator commented Oct 21, 2024

pchila commented Nov 6, 2024

pchila commented Nov 6, 2024 •

edited

Loading

cmacknz commented Nov 6, 2024

blakerouse commented Nov 6, 2024

Agent gets unhealthy temporarily because Beat monitoring sockets are not available #5332

Agent gets unhealthy temporarily because Beat monitoring sockets are not available #5332

Comments

amolnater-qasource commented Aug 21, 2024

elasticmachine commented Aug 21, 2024

amolnater-qasource commented Aug 21, 2024

ghost commented Aug 21, 2024

pierrehilbert commented Aug 28, 2024

amolnater-qasource commented Aug 28, 2024

cmacknz commented Sep 3, 2024

ycombinator commented Oct 19, 2024

cmacknz commented Oct 21, 2024

ycombinator commented Oct 21, 2024

pchila commented Nov 6, 2024

pchila commented Nov 6, 2024 • edited Loading

cmacknz commented Nov 6, 2024

blakerouse commented Nov 6, 2024

pchila commented Nov 6, 2024 •

edited

Loading