-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent gets unhealthy temporarily because Beat monitoring sockets are not available #5332
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@karanbirsingh-qasource Please review. |
secondary review of this ticket is done |
Are we still unhealthy after restarting the Agent? |
Hi @pierrehilbert
We have revalidated by setting the Remote Elasticsearch output first and then installing the agent. Agent Logs: Please let us know if we anything else is required from our end. |
Looking at logs this doesn't have anything to do with remote ES and is caused by the monitoring endpoints of the Beats not being available fast enough:
|
@cmacknz is there any work to be done here or is this normal, expected behavior? |
This behavior is currently causing flakiness in our integration tests so I think we should address is. This is the first example I've seen outside of our tests, which means any fix can't be in our test infrastructure it has to go in agent itself. I think the rate at which we poll for this endpoint to exist is the new 60s metrics interval, so you can have to poll the agent health for 1+ minute to see if it recovers. A fix might look like retrying the connection of this socket faster in the agent. It is also possible beat startup has gotten slower. |
Added it to the current sprint. |
Had a quick look around the code of elastic-agent and beats to try and sum up the issue:
At this point I see a few options:
@ycombinator @pierrehilbert @jlind23 @cmacknz what do you think would be an acceptable solution for this ? |
A very valid suggestion from @pkoutsovasilis that can be added to the list of options: |
This bug report is an instance of this happening outside of our tests, our tests are just built to detect this easily, and I think the problem is ultimately transient.
No, the space savings are cost savings in the end.
I like this, seems like simple and works as long as this problem is self-correcting. |
Or another correct solution would be to stop restarting the beats when the output changes ;-) |
Kibana Build details:
Preconditions:
Steps to reproduce:
NOTE:
Expected Result:
Agent should remain healthy on switching integration output to Remote ES.
Logs:
elastic-agent-diagnostics-2024-08-21T09-40-35Z-00.zip
Screen Recording:
ip-172-31-18-23.-.Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-08-21.15-09-58.mp4
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-08-21.15-11-25.mp4
Feature:
elastic/kibana#143905
The text was updated successfully, but these errors were encountered: