Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a service to put the instance out of traffic smoothly before shutdown #8

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ahogui12
Copy link
Member

@ahogui12 ahogui12 commented Apr 5, 2024

why
The DNS TTLs are about 30 seconds in our infra. When we let a time for the instance to cool off before shutting it down, it gives time to DNS re-resolving and avoid targeting a dead instance

Example
Templated file on Haproxy instance

[Unit]
Description=Consul maintenance task at shutdown
After=consul.service
After=haproxy.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/usr/local/bin/consul maint -enable -reason="Shutting down instance"
ExecStopPost=/usr/bin/sleep 30

[Install]
WantedBy=multi-user.target

After I type shutdown now there is a 30s timeframe where consul is under maintenance, but haproxy is still available

Note
I am not 100% sure there is no side effects. This works on Haproxy instance well and the wost case I can see is {{ consul_service_name }} is not templated correctly but in this case the service just won't start and we will see current behaviour.

@ahogui12 ahogui12 requested a review from sduthil April 5, 2024 18:18
Copy link
Contributor

Copy link
Member

@sduthil sduthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good idea :) Thanks!

Type=oneshot
RemainAfterExit=yes
ExecStop=/usr/local/bin/consul maint -enable
ExecStopPost=/usr/bin/sleep 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duration should be templated too. I would recommend a higher timeout, e.g. 45 seconds or 60 seconds, to avoid race conditions between DNS cache expiration and DNS client resolution that may happen simultaneously, or with a few seconds of delay.

Also, a higher value may require adding a TimeoutStopSec= (default 90 seconds).

This duration should only be that high for publicly exposed instances, such as HAProxies or Coturn services (maybe others?). Other internal services are resolved on-the-spot via the Consul resolver, and could do with a lower timeout, e.g. 5 or 10 seconds.

Do you know what happens if we systemctl restart haproxy for example? Do it wait for 30 seconds too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the TimeoutStopSec option, do you think we will happen to need more than 90 seconds ? Otherwise it stops the service (with an error for sure)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If systemctl restart haproxy , nothing happens, it still waits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the TimeoutStopSec option, it's only needed if we template the sleep duration. If it is templated, either document to not exceed 90seconds, or template the TimeoutStopSec as well.

About the restart, does it mean that running systemctl restart haproxy will enable Consul maintenance, wait 30 seconds, restart HAProxy, and leave Consul maintenance enabled? So restarting HAProxy would remove the instance from traffic?

tasks/main.yml Outdated
- name: Upload a service to put consul under maintenance at shutdown
template:
src: consul-maint-before-shutdown.service.j2
dest: /lib/systemd/system/consul-maint-before-shutdown.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this should rather go in /etc/systemd/system/.... The /lib/systemd path is used for OS-delivered unit files.

tasks/main.yml Outdated
dest: /lib/systemd/system/consul-maint-before-shutdown.service
owner: root
group: root
mode: "0644"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mode: "0644"
mode: "0660"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this change, if I create a systemd service I end up with "0644"

root@i-0c539d980708b2e2a:~# systemctl edit --force --full ahoguin.service
root@i-0c539d980708b2e2a:~# ls -la /etc/systemd/system/
...
-rw-r--r--  1 root root  194 Apr 10 15:58 ahoguin.service

[Unit]
Description=Consul maintenance task at shutdown
After=consul.service
After={{ consul_service_name }}.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the value for consul_service_name is wrong, I would prefer this not to fail silently. Will it fail at the state=started task?

If not, could we have some alert? Like an assert that the service file exists, which will fail the whole recipe and send a Mattermost alert?

Also, this implies that the Consul service is installed after the service it requires, not sure it's the case in every role...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what happens if consul_service_name is incorrect?

…shutdown

**why**
The DNS TTL are about 30 seconds in our infra. When we let a time for the instance to cool off before shutting it down, it gives time to DNS re-resolving and avoid targetting a dead instance
@ahogui12 ahogui12 force-pushed the consul-maint-before-shutdown branch from bd2d192 to 44c1466 Compare April 10, 2024 00:00
Copy link
Contributor

Copy link
Contributor

@ahogui12 ahogui12 requested a review from jfbourque April 11, 2024 13:32
**why**
We will activate it easily on some services (Haproxy, nginx) and will avoid put it on services where consul-service service is ran before the actual expected service
**contains**
- adds molecule suite tests for the feature
- removes vagrant related tests
- sets `requests` version
@ahogui12 ahogui12 force-pushed the consul-maint-before-shutdown branch from e8c1f9e to eae825e Compare April 16, 2024 17:38
@ahogui12 ahogui12 requested a review from sduthil April 16, 2024 17:39
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants