Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to enable persistence to the status, to allow restart-ability to the script #4

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

realmgic
Copy link
Member

In some cases the script exits (or hangs) on various (unknown) reasons, leaving the cluster in quiesce mode. Saving the state to a file will allow restarting the script and if the node is out of maintenance mode, it will automatically undo the quiesce.

@realmgic
Copy link
Member Author

@spkesan - do you have some time to review and merge this? thanks! :)

@arrowplum arrowplum requested a review from spkesan November 12, 2019 18:38
@spkesan
Copy link

spkesan commented Nov 18, 2019

@realmgic
Thanks for the work and PR. I'll review the changes.
Do we know why 'the script exits (or hangs) on various (unknown) reasons'.

@spkesan
Copy link

spkesan commented Nov 18, 2019

Hi @realmgic
If we just want to restart the script to perform quiesce-undo (if the node was quiesced by the script and the script did not completely run for some reason, I guess that's what you are trying to address here?), wouldn't it be just simple to pass in last_maintenance_event via a command line option? We don't need these many changes or persist the last event, right.

Let's say:

  1. maintenance-event changed to MIGRATE_ON_HOST_MAINTENANCE .
  2. The script realizes this and quiesced this node.
  3. After the live migration, the maintenance-event will be changed to NONE, but let's say before this point, the script exists (for some unknown reason as you mentioned).
  4. Now the node is in quiesced state. We need to quiesce-undo the node.
  5. We can restart the script by passing last_maintenance_event as MIGRATE_ON_HOST_MAINTENANCE. The script will perform the quiesce-undo since the latest maintenance-event will be None.
  • Ideally we should find out and fix why the script is hanging or exiting unexpectedly.
  • Also improve logging to know the last state of the script (since it's only 60 seconds from the metadata change to actual start of maintenance event).

When you observed the script hung or exited unexpectedly, did you check or collect the log file (/var/log/aerospike/agm.log)?

@realmgic
Copy link
Member Author

realmgic commented Nov 18, 2019

Hi @spkesan,

When the script (or actually, the systemd service which runs it) is restarted, we don't know what was the last state we observed and what is the cluster state.

In that specific case, the service was stopped after maintenance flag was raised but before we got the NONE event to clear it. When the service started again, we got NONE, so we assumed the cluster is "fine" and the maintenance (quiesce) wasn't cleared. from the cluster perspective, that node stayed quiesced for another full day or two before someone noticed it on a dashboard somewhere after the weekend.

What I did is to persist the fact that the last state we saw was maintenance (saved to a file) so when we come back, we can check for that and undo the quiesce modes that might be there (if we get NONE) or know that we're still in maintenance mode and wait for it to end (if we get MIGRATE_ON_HOST_MAINTENANCE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants