-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a service to put the instance out of traffic smoothly before shutdown #8
base: master
Are you sure you want to change the base?
Conversation
Build failed. ❌ wazo-tox-molecule-bullseye FAILURE in 3m 54s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good idea :) Thanks!
Type=oneshot | ||
RemainAfterExit=yes | ||
ExecStop=/usr/local/bin/consul maint -enable | ||
ExecStopPost=/usr/bin/sleep 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duration should be templated too. I would recommend a higher timeout, e.g. 45 seconds or 60 seconds, to avoid race conditions between DNS cache expiration and DNS client resolution that may happen simultaneously, or with a few seconds of delay.
Also, a higher value may require adding a TimeoutStopSec=
(default 90 seconds).
This duration should only be that high for publicly exposed instances, such as HAProxies or Coturn services (maybe others?). Other internal services are resolved on-the-spot via the Consul resolver, and could do with a lower timeout, e.g. 5 or 10 seconds.
Do you know what happens if we systemctl restart haproxy
for example? Do it wait for 30 seconds too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the TimeoutStopSec
option, do you think we will happen to need more than 90 seconds ? Otherwise it stops the service (with an error for sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If systemctl restart haproxy
, nothing happens, it still waits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the TimeoutStopSec
option, it's only needed if we template the sleep duration. If it is templated, either document to not exceed 90seconds, or template the TimeoutStopSec
as well.
About the restart, does it mean that running systemctl restart haproxy
will enable Consul maintenance, wait 30 seconds, restart HAProxy, and leave Consul maintenance enabled? So restarting HAProxy would remove the instance from traffic?
tasks/main.yml
Outdated
- name: Upload a service to put consul under maintenance at shutdown | ||
template: | ||
src: consul-maint-before-shutdown.service.j2 | ||
dest: /lib/systemd/system/consul-maint-before-shutdown.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this should rather go in /etc/systemd/system/...
. The /lib/systemd
path is used for OS-delivered unit files.
tasks/main.yml
Outdated
dest: /lib/systemd/system/consul-maint-before-shutdown.service | ||
owner: root | ||
group: root | ||
mode: "0644" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode: "0644" | |
mode: "0660" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this change, if I create a systemd service I end up with "0644"
root@i-0c539d980708b2e2a:~# systemctl edit --force --full ahoguin.service
root@i-0c539d980708b2e2a:~# ls -la /etc/systemd/system/
...
-rw-r--r-- 1 root root 194 Apr 10 15:58 ahoguin.service
[Unit] | ||
Description=Consul maintenance task at shutdown | ||
After=consul.service | ||
After={{ consul_service_name }}.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the value for consul_service_name
is wrong, I would prefer this not to fail silently. Will it fail at the state=started
task?
If not, could we have some alert? Like an assert
that the service file exists, which will fail the whole recipe and send a Mattermost alert?
Also, this implies that the Consul service is installed after the service it requires, not sure it's the case in every role...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what happens if consul_service_name
is incorrect?
…shutdown **why** The DNS TTL are about 30 seconds in our infra. When we let a time for the instance to cool off before shutting it down, it gives time to DNS re-resolving and avoid targetting a dead instance
bd2d192
to
44c1466
Compare
Build failed. ❌ wazo-tox-molecule-bullseye FAILURE in 3m 08s |
Build failed. ❌ wazo-tox-molecule-bullseye FAILURE in 3m 08s |
**why** We will activate it easily on some services (Haproxy, nginx) and will avoid put it on services where consul-service service is ran before the actual expected service
**contains** - adds molecule suite tests for the feature - removes vagrant related tests - sets `requests` version
e8c1f9e
to
eae825e
Compare
Build succeeded. ✔️ wazo-tox-molecule-bullseye SUCCESS in 8m 06s |
why
The DNS TTLs are about 30 seconds in our infra. When we let a time for the instance to cool off before shutting it down, it gives time to DNS re-resolving and avoid targeting a dead instance
Example
Templated file on Haproxy instance
After I type
shutdown now
there is a 30s timeframe where consul is under maintenance, but haproxy is still availableNote
I am not 100% sure there is no side effects. This works on Haproxy instance well and the wost case I can see is
{{ consul_service_name }}
is not templated correctly but in this case the service just won't start and we will see current behaviour.