Add option to enable persistence to the status, to allow restart-ability to the script #4

realmgic · 2019-10-10T13:58:23Z

In some cases the script exits (or hangs) on various (unknown) reasons, leaving the cluster in quiesce mode. Saving the state to a file will allow restarting the script and if the node is out of maintenance mode, it will automatically undo the quiesce.

…handle "legit" (200) exits without a status change.

… be restartable and will pick up where it left off

…vice 2. fix initial state of None to "NONE" for persistance

realmgic · 2019-10-24T12:02:13Z

@spkesan - do you have some time to review and merge this? thanks! :)

spkesan · 2019-11-18T06:25:47Z

@realmgic
Thanks for the work and PR. I'll review the changes.
Do we know why 'the script exits (or hangs) on various (unknown) reasons'.

spkesan · 2019-11-18T09:11:51Z

Hi @realmgic
If we just want to restart the script to perform quiesce-undo (if the node was quiesced by the script and the script did not completely run for some reason, I guess that's what you are trying to address here?), wouldn't it be just simple to pass in last_maintenance_event via a command line option? We don't need these many changes or persist the last event, right.

Let's say:

maintenance-event changed to MIGRATE_ON_HOST_MAINTENANCE .
The script realizes this and quiesced this node.
After the live migration, the maintenance-event will be changed to NONE, but let's say before this point, the script exists (for some unknown reason as you mentioned).
Now the node is in quiesced state. We need to quiesce-undo the node.
We can restart the script by passing last_maintenance_event as MIGRATE_ON_HOST_MAINTENANCE. The script will perform the quiesce-undo since the latest maintenance-event will be None.

Ideally we should find out and fix why the script is hanging or exiting unexpectedly.
Also improve logging to know the last state of the script (since it's only 60 seconds from the metadata change to actual start of maintenance event).

When you observed the script hung or exited unexpectedly, did you check or collect the log file (/var/log/aerospike/agm.log)?

realmgic · 2019-11-18T16:51:58Z

Hi @spkesan,

When the script (or actually, the systemd service which runs it) is restarted, we don't know what was the last state we observed and what is the cluster state.

In that specific case, the service was stopped after maintenance flag was raised but before we got the NONE event to clear it. When the service started again, we got NONE, so we assumed the cluster is "fine" and the maintenance (quiesce) wasn't cleared. from the cluster perspective, that node stayed quiesced for another full day or two before someone noticed it on a dashboard somewhere after the weekend.

What I did is to persist the fact that the last state we saw was maintenance (saved to a file) so when we come back, we can check for that and undo the quiesce modes that might be there (if we get NONE) or know that we're still in maintenance mode and wait for it to end (if we get MIGRATE_ON_HOST_MAINTENANCE)

realmgic added 7 commits October 10, 2019 12:56

add an option to configure timeout for the metadata service call and …

43885c9

…handle "legit" (200) exits without a status change.

Add the ability to persist the last status to file so the script will…

0aa210e

… be restartable and will pick up where it left off

change the persistance default to false

ea86bef

add option to specify persistance at the command line

41dc902

add documentation to the new command line options

0c6eedc

1. add parameter to configure shorter timeout for google metadata ser…

1b3a6fd

…vice 2. fix initial state of None to "NONE" for persistance

fix documentation for timeout parameter

84872a3

arrowplum requested a review from spkesan November 12, 2019 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to enable persistence to the status, to allow restart-ability to the script #4

Add option to enable persistence to the status, to allow restart-ability to the script #4

realmgic commented Oct 10, 2019

realmgic commented Oct 24, 2019

spkesan commented Nov 18, 2019

spkesan commented Nov 18, 2019

realmgic commented Nov 18, 2019 •

edited

Loading

Add option to enable persistence to the status, to allow restart-ability to the script #4

Are you sure you want to change the base?

Add option to enable persistence to the status, to allow restart-ability to the script #4

Conversation

realmgic commented Oct 10, 2019

realmgic commented Oct 24, 2019

spkesan commented Nov 18, 2019

spkesan commented Nov 18, 2019

realmgic commented Nov 18, 2019 • edited Loading

realmgic commented Nov 18, 2019 •

edited

Loading