-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
**OLD** [DPE-4575] Add voting settle logic at start and stop service #345
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @phvalguima I've tested this locally and ran into issues when removing the application with only two units left. It runs into
unit-opensearch-3: 09:41:16 ERROR unit.opensearch/3.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-opensearch-3/charm/./src/charm.py", line 267, in <module>
main(OpenSearchOperatorCharm)
File "/var/lib/juju/agents/unit-opensearch-3/charm/venv/ops/main.py", line 544, in main
manager.run()
File "/var/lib/juju/agents/unit-opensearch-3/charm/venv/ops/main.py", line 520, in run
self._emit()
File "/var/lib/juju/agents/unit-opensearch-3/charm/venv/ops/main.py", line 509, in _emit
_emit_charm_event(self.charm, self.dispatcher.event_name)
File "/var/lib/juju/agents/unit-opensearch-3/charm/venv/ops/main.py", line 143, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-opensearch-3/charm/venv/ops/framework.py", line 350, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-opensearch-3/charm/venv/ops/framework.py", line 849, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-opensearch-3/charm/venv/ops/framework.py", line 939, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-opensearch-3/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 523, in _on_opensearch_data_storage_detaching
raise Exception("Unable to acquire lock: Another unit is starting or stopping.")
Exception: Unable to acquire lock: Another unit is starting or stopping.
and can't get out of it anymore. Furthermore I've seen some failures on the integration tests, so it looks like it does not really work yet.
8ac22f7
to
d37b334
Compare
9f168d6
to
1bc6555
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
…ix issue on update_host_if_needed
Fix docstring
Currently, we are having several issues with 2-node clusters as OpenSearch won't automatically manage voting anymore. This PR will add logic to manually manage the 2-node cluster scenario using the
voting_exclusions
API. Whenever we have only two nodes active and registered to the cluster as voting units,_settle_voting_exclusions
will exclude the non-leader unit from voting. It also excludes the leaving unit, as we need to cover for the scenario where that unit is the cluster_manager and moving from 3->2 units we may end up with stale metadata.This PR also makes exclusion mandatory to happen at start / stop. Therefore, we are sure the voting count will always be correct in each stage.
For example, moving from 3->2->1 units results:
3->2) The cluster will set 2x voting exclusions: one for the unit leaving (if this is the cluster manager, that position will move away) and one for one of the 2x remaining units, following a sorted list
2->1) The voting exclusions are removed
Likewise, on scaling up:
1->2) One voting exclusion is added, following the same sorted list of all node names
2->3) Voting exclusions are removed
This PR touches #324, #326, #327 and in #325 this behavior is also observed. This is also linked to issues in our
ha/test_storage.py
, has one can see in this run.Closes: #324, #326, #327