Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: On enrolling RPM and Deb agents, Restarting agent failed error is displayed in CLI. #4084

Open
harshitgupta-qasource opened this issue Jan 15, 2024 · 8 comments · May be fixed by #6494
Assignees
Labels
bug Something isn't working impact:medium Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@harshitgupta-qasource
Copy link

harshitgupta-qasource commented Jan 15, 2024

Kibana Build details:

VERSION: 8.12.0 BC6
BUILD: 70088
COMMIT: e9092c0a17923f4ed984456b8a5db619b0a794b3
Artifact Link: https://staging.elastic.co/8.12.0-3eba7f46/summary-8.12.0.html#elastic-agent

Host OS and Browser version: All, All

Preconditions:

  1. 8.12.0 Kibana Cloud environment should be available.
  2. Policy should be created.
  3. Deb/RPM agent should be extracted.

Steps to reproduce:

  1. Run agent enroll command for RPM/DEB
  2. Observe that on enrolling RPM and Deb agents, Restarting agent failed error is displayed in CLI.

What's working fine:

  • We are able to enroll the agents on running enable and start elastic-agent command.

Expected:
On enrolling RPM and Deb agents restarting agent error should not display in CLI.

Screenshot:
image (1)

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team impact:medium labels Jan 15, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@amolnater-qasource
Copy link

Secondary Review for this ticket is Done.

@cmacknz
Copy link
Member

cmacknz commented Jan 15, 2024

I can reproduce this, suspect another unintended consequence of #3815 where we now always consider a failure to restart with the control socket a fatal error.

The agent service isn't automatically started after running dpkg -i so the enroll commands attempt to restart it cannot succeed. Likely the fix will be similar to #4042, we need to skip the attempt to restart the agent in this case because it is supposed to be manual.

ubuntu@valuable-gudgeon:~$ sudo dpkg -i ./elastic-agent-8.12.0-arm64.deb
Selecting previously unselected package elastic-agent.
(Reading database ... 66270 files and directories currently installed.)
Preparing to unpack .../elastic-agent-8.12.0-arm64.deb ...
Unpacking elastic-agent (8.12.0) ...
Setting up elastic-agent (8.12.0) ...
found symlink /usr/share/elastic-agent/bin/elastic-agent, unlink
create symlink /usr/share/elastic-agent/bin/elastic-agent to /var/lib/elastic-agent/data/elastic-agent-5cbf2e/elastic-agent
ubuntu@valuable-gudgeon:~$ sudo systemctl status elastic-agent
○ elastic-agent.service - Agent manages other beats based on configuration provided.
     Loaded: loaded (/lib/systemd/system/elastic-agent.service; disabled; vendor preset: enabled)
     Active: inactive (dead)
       Docs: https://www.elastic.co/beats/elastic-agent
ubuntu@valuable-gudgeon:~$ sudo elastic-agent enroll --url=https://2d8b862d544f4fbca4ff375dfae3b19f.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=Qmtvei1vd0JvRFNMYWwxdC04bTU6R3lldEtHc01SYW1iQy1pYU9qOFRsZw==
This will replace your current settings. Do you want to continue? [Y/n]:y
{"log.level":"info","@timestamp":"2024-01-15T11:35:48.449-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":496},"message":"Starting enrollment to URL: https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-01-15T11:35:49.770-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":461},"message":"Restarting agent daemon, attempt 0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-15T11:35:49.771-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":475},"message":"Restart attempt 0 failed: 'rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/lib/elastic-agent/data/tmp/elastic-agent-control.sock: connect: no such file or directory\"'. Waiting for 2s","ecs.version":"1.6.0"}

The instructions for enrolling a DEB in Fleet already include manually starting the service already for this reason:

curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.0-arm64.deb
sudo dpkg -i elastic-agent-8.12.0-arm64.deb
sudo elastic-agent enroll --url=https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXXX
sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent

@cmacknz
Copy link
Member

cmacknz commented Jan 15, 2024

I should note that the error here doesn't mean the enrollment failed, enrollment actually succeeded and if you ignore the error and continue with the following the agent successfully connects to Fleet.

sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent

@cmacknz
Copy link
Member

cmacknz commented Jan 17, 2024

We should just need to pass the --skip-daemon-reload flag to the enroll command run by the DEB and RPM packages:

cmd.Flags().Bool("skip-daemon-reload", false, "Skip daemon reload after enrolling")

@cmacknz
Copy link
Member

cmacknz commented Feb 23, 2024

You can also avoid the error by starting the agent service before enrolling.

sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent

@cmacknz
Copy link
Member

cmacknz commented Feb 23, 2024

An alternative to fixing this in the agent is to change the instructions in Fleet to start the service before enrolling:

This is what we have today:

curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.2-amd64.deb
sudo dpkg -i elastic-agent-8.12.2-amd64.deb
sudo elastic-agent enroll --url=https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXXX
sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent

We are also investigating automatically starting the service as part of the deb/rpm installer.

@leandrojmp
Copy link
Contributor

Hello @cmacknz

I should note that the error here doesn't mean the enrollment failed, enrollment actually succeeded and if you ignore the error and continue with the following the agent successfully connects to Fleet.

While this is true, this has some impact when using automation tools.

For example, when using ansible it relies on the exit code of the previous command to know if it can continue to the next task on the playbook or exit with an error, currently the enroll command as described in the Fleet UI instructions will always fail, returning an exit code of 1 which will then halt the ansible playbook.

I was helping one of the infra teams in my company write an ansible playbook to deploy the agents and spent a couple of time troubleshooting why it was not working and always failing in the enrollment step.

I was only able to fix the playbook because I found this issue and the undocumented flag --skip-daemon-reload, I think this should be present in the documentation page.

After that, I tested on another server and using --delay-enroll also works.

Since the next steps consists in enable the systemd service and start it, we choose to use --delay-enroll as this is a little more faster in the ansible playbook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants