Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] how to manage frequent restart in hirte and hirte-agent? #427

Open
dougsland opened this issue Aug 2, 2023 · 2 comments
Open
Labels
backlog This is next up in priority bug Something isn't working good first issue Good for newcomers

Comments

@dougsland
Copy link
Contributor

Describe the bug

Let's imagine we had a crash in hirte daemon or agent. For example: #425

  • Should hirte (same apply to agent) service keep down due: hirte.service: Start request repeated too quickly. ?
  • keep trying until is able to restore? (i.e: a new config was sent to network) but how long to wait until to try the restart? - What's the minimum possible wait until to restart the node or redeploy? (agents depend on manager node to report)

There are systemd service keys that might help this behavior: StartLimitInterval and StartLimitBurst

Output of systemctl status hirte:

× hirte.service - Hirte systemd service controller manager daemon
     Loaded: loaded (/usr/local/lib/systemd/system/hirte.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Wed 2023-08-02 05:45:53 UTC; 1s ago
   Duration: 3ms
       Docs: man:hirte(1)
             man:hirte.conf(5)
    Process: 214542 ExecStart=/usr/bin/hirte -c /etc/hirte/hirte.conf (code=exited, status=1/FAILURE)
   Main PID: 214542 (code=exited, status=1/FAILURE)
        CPU: 3ms

Aug 02 05:45:53 control systemd[1]: hirte.service: Scheduled restart job, restart counter is at 5.
Aug 02 05:45:53 control systemd[1]: Stopped Hirte systemd service controller manager daemon.
Aug 02 05:45:53 control systemd[1]: hirte.service: Start request repeated too quickly.
Aug 02 05:45:53 control systemd[1]: hirte.service: Failed with result 'exit-code'.
Aug 02 05:45:53 control systemd[1]: Failed to start Hirte systemd service controller manager daemon.
@dougsland dougsland added the bug Something isn't working label Aug 2, 2023
@rhatdan
Copy link
Contributor

rhatdan commented Aug 2, 2023

I would think a couple of restarts only. My understanding of FUSA, would be that once it fails, the car needs to go into safety mode.

@engelmi
Copy link
Member

engelmi commented Aug 3, 2023

I thought we already added those, but it seem in #231 we only considered it.
Maybe can also set RestartSec - the default of 100ms seems pretty fast. Using RestartSteps and RestartMaxDelaySec for a kind of exponential backoff could also be interesting.

@mkemel mkemel added this to the v0.7 milestone Nov 21, 2023
@mkemel mkemel added jira Issues that are synced to Jira backlog This is next up in priority and removed jira Issues that are synced to Jira labels Nov 21, 2023
@mkemel mkemel removed this from the v0.7 milestone Nov 21, 2023
@mkemel mkemel added the good first issue Good for newcomers label Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog This is next up in priority bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants