-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367
Conversation
…ix issue on update_host_if_needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Pedro, the PR looks quite good. I've tested esp. the scenario with re-attaching storage locally multiple times and could not reproduce the issue anymore. That's good!
I just left some questions and comments for details, but in general I'd say this is fine.
@@ -537,8 +543,6 @@ def _on_update_status(self, event: UpdateStatusEvent): | |||
|
|||
# if there are exclusions to be removed | |||
if self.unit.is_leader(): | |||
self.opensearch_exclusions.cleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While before this cleanup was done on every update_status
, now it's only done when the Health is green
. Is this on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not the case... It is done on every case, except HealthColors.UNKNOWN
. Indeed, we defer the event if it is not green... I put it down there because I need the API to be responsive before configuring voting exclusions. If it is not responsive, we will get UNKNOWN anyways and retry later anyways.
I will add some comments to clarify that.
self.opensearch_exclusions.add_voting(hosts, node_names=[sorted_cm[0]]) | ||
# Now, we clean up the sorted_cm list, as we want to be sure the new manager is elected | ||
# and different than the excluded units. | ||
sorted_cm.pop(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why you chose to exclude node [0] from voting and remove it from the list? This will result in a new cluster manager node
being elected when you scale up from 1 unit to more and in the process of removing the application. I just saw this locally when testing, not that it creates that much latency, but wouldn't it be better to use the last one from the list instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, @reneradoi yes, I noticed that as well. But when I was discussing with @Mehdi-Bendriss, we've agreed to make this list predictable instead of keeping track of the cluster manager. I could add a check here, for that, but then the other check, right before, gets slightly more complicated:
if unit_is_stopping:
# Remove both this unit and the first sorted_cm from the voting
self.opensearch_exclusions.add_voting(
hosts, node_names=[self.unit_name, sorted_cm[0]] ## <<<------ should we also add a check here?
)
Given this is quite an exception (i.e. going from 1->2 or 2->1), I took the simpler approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can improve the comments around here, this logic is pretty brittle tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I overlooked and missed a safety component there, where if we remain with 2 units - and the one being removed is not the current elected CM; maybe it makes sense to add the other unit to the voting exclusion.
This should reduce switchovers and the risk it entails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mehdi-Bendriss I would not recommend that. The main reason is because I noticed moving the elected manager between nodes is far faster than juju hooks. We need to be predictable in this specific case, even if it means moving the elected manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pedro - I have a few questions. There is also the aspect of the missing exception handling of settle_voting
in various places in the code.
@@ -573,6 +585,9 @@ def _on_config_changed(self, event: ConfigChangedEvent): # noqa C901 | |||
restart_requested = False | |||
if self.opensearch_config.update_host_if_needed(): | |||
restart_requested = True | |||
# Review voting exclusions as our IP has changed: we may be coming back from a network | |||
# outage case. | |||
self._settle_voting_exclusions(unit_is_stopping=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen here if no other node is online self.alt_hosts == []
? I believe this will try for 5min and eventually crash with a RetryError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mehdi-Bendriss so, self.alt_hosts
will not return the local host in the list? How can I get it then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the way we call both ClusterTopology's elected_manager
and nodes
. They will now do more checks on self.alt_hosts.
@@ -537,17 +548,26 @@ def _on_update_status(self, event: UpdateStatusEvent): | |||
|
|||
# if there are exclusions to be removed | |||
if self.unit.is_leader(): | |||
self.opensearch_exclusions.cleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the shards allocation exclusion cleanup is postponed until later in the hook? As long as there is connectivity to a host, we should be able to cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the health checks below allow anything pass, unless the cluster is really on a bad state (i.e. UNKNOWN). So, moved the check below these first health checks because it makes more sense.
resp_status_code=True, | ||
retries=3, | ||
) | ||
return True | ||
except OpenSearchHttpError: | ||
return False | ||
|
||
def _delete_voting(self) -> bool: | ||
def delete_voting(self, alt_hosts: Optional[List[str]] = None) -> bool: | ||
"""Remove all the voting exclusions - cannot target 1 exclusion at a time.""" | ||
# "wait_for_removal" is VERY important, it removes all voting configs immediately | ||
# and allows any node to return to the voting config in the future | ||
try: | ||
self._opensearch.request( | ||
"DELETE", | ||
"/_cluster/voting_config_exclusions?wait_for_removal=false", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not remember why did I set in the past wait_for_removal=false
instead of true
. I now don't see a reason why it shouldn't be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mehdi-Bendriss on my own tests, I could not get this DELETE to work with wait_for_removal=true
... I wanted to move it to True as well, btw!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mehdi-Bendriss, found this in the docs:
Defaults to true, meaning that all excluded nodes must be removed from the cluster before this API takes any action.
So, wait_for_removal=true expects the node to be gone entirely... That is why it is not suitable for us.
…oting-exclusion-2-units
…-exclusion-2-units
… to after settle_voting
…-exclusion-2-units
…calling update-status in test_ha
## Issue When a new TLS certificate authority (CA) certificate is issued, the opensearch-operator should add this new CA to all its units and request new certificates. The new certificates (including the CA certificate) should be distributed to all OpenSearch nodes in a rolling restart manner, without downtime to the entire cluster. Due to limitations on the self-signed-certificates operator it is not possible to: - get a notice if a CA certificate is about to expire - request a new CA when the current one is about to or has expired - request an intermediate CA and sign future certificates with it There is currently no support for renewing a root / CA certificate on the self-signed-certificates operator. A new root / CA certificate will only be generated and issued if the common_name of the CA changes. We have decided to implement the logic in that way that we check each certificate if it includes a new CA. If so, we store the new CA and initiate the CA rotation workflow on OpenSearch. ## Solution This PR implements the following workflow: - check each `CertificateAvailableEvent` if it includes a new CA - add the new CA to the truststore - add a notice `tls_ca_renewing` to the unit's peer data - initiate a restart of OpenSearch (using the locking mechanism to coordinate cluster availability during the restart) - after restarting, add a notice `tls_ca_renewed` to the unit's peer data - when the restart is done on all of the cluster nodes, request new TLS certificates and apply them to the node During the phase of renewing the CA, all incoming `CertificateAvailableEvents` will be deferred in order to avoid incompatibilites in communication between the nodes. Please also see the flow of events and actions that has been documented here: https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow ## Notes - There is a dependency to #367 because during the rolling restart when the CA is rotated it is very likely that the voting exclusion issue shows up (at least in 3-node-clusters). Therefore the integration test is currently running only with two nodes. Once the voting exclusions issue is resolved, this can be updated to the usual three nodes. - Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more details see opensearch-project/security#3299). - This PR introduces a method to append configuration to the jvm options file of OpenSearch (used to set TLS config to v1.2). --------- Co-authored-by: Mehdi Bendriss <[email protected]> Co-authored-by: Judit Novak <[email protected]>
When a new TLS certificate authority (CA) certificate is issued, the opensearch-operator should add this new CA to all its units and request new certificates. The new certificates (including the CA certificate) should be distributed to all OpenSearch nodes in a rolling restart manner, without downtime to the entire cluster. Due to limitations on the self-signed-certificates operator it is not possible to: - get a notice if a CA certificate is about to expire - request a new CA when the current one is about to or has expired - request an intermediate CA and sign future certificates with it There is currently no support for renewing a root / CA certificate on the self-signed-certificates operator. A new root / CA certificate will only be generated and issued if the common_name of the CA changes. We have decided to implement the logic in that way that we check each certificate if it includes a new CA. If so, we store the new CA and initiate the CA rotation workflow on OpenSearch. This PR implements the following workflow: - check each `CertificateAvailableEvent` if it includes a new CA - add the new CA to the truststore - add a notice `tls_ca_renewing` to the unit's peer data - initiate a restart of OpenSearch (using the locking mechanism to coordinate cluster availability during the restart) - after restarting, add a notice `tls_ca_renewed` to the unit's peer data - when the restart is done on all of the cluster nodes, request new TLS certificates and apply them to the node During the phase of renewing the CA, all incoming `CertificateAvailableEvents` will be deferred in order to avoid incompatibilites in communication between the nodes. Please also see the flow of events and actions that has been documented here: https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow - There is a dependency to #367 because during the rolling restart when the CA is rotated it is very likely that the voting exclusion issue shows up (at least in 3-node-clusters). Therefore the integration test is currently running only with two nodes. Once the voting exclusions issue is resolved, this can be updated to the usual three nodes. - Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more details see opensearch-project/security#3299). - This PR introduces a method to append configuration to the jvm options file of OpenSearch (used to set TLS config to v1.2). --------- Co-authored-by: Mehdi Bendriss <[email protected]> Co-authored-by: Judit Novak <[email protected]>
When a new TLS certificate authority (CA) certificate is issued, the opensearch-operator should add this new CA to all its units and request new certificates. The new certificates (including the CA certificate) should be distributed to all OpenSearch nodes in a rolling restart manner, without downtime to the entire cluster. Due to limitations on the self-signed-certificates operator it is not possible to: - get a notice if a CA certificate is about to expire - request a new CA when the current one is about to or has expired - request an intermediate CA and sign future certificates with it There is currently no support for renewing a root / CA certificate on the self-signed-certificates operator. A new root / CA certificate will only be generated and issued if the common_name of the CA changes. We have decided to implement the logic in that way that we check each certificate if it includes a new CA. If so, we store the new CA and initiate the CA rotation workflow on OpenSearch. This PR implements the following workflow: - check each `CertificateAvailableEvent` if it includes a new CA - add the new CA to the truststore - add a notice `tls_ca_renewing` to the unit's peer data - initiate a restart of OpenSearch (using the locking mechanism to coordinate cluster availability during the restart) - after restarting, add a notice `tls_ca_renewed` to the unit's peer data - when the restart is done on all of the cluster nodes, request new TLS certificates and apply them to the node During the phase of renewing the CA, all incoming `CertificateAvailableEvents` will be deferred in order to avoid incompatibilites in communication between the nodes. Please also see the flow of events and actions that has been documented here: https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow - There is a dependency to #367 because during the rolling restart when the CA is rotated it is very likely that the voting exclusion issue shows up (at least in 3-node-clusters). Therefore the integration test is currently running only with two nodes. Once the voting exclusions issue is resolved, this can be updated to the usual three nodes. - Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more details see opensearch-project/security#3299). - This PR introduces a method to append configuration to the jvm options file of OpenSearch (used to set TLS config to v1.2). --------- Co-authored-by: Mehdi Bendriss <[email protected]> Co-authored-by: Judit Novak <[email protected]>
Currently, we are having several issues with 2-node clusters as OpenSearch won't automatically manage voting anymore. This PR will add logic to manually manage the 2-node cluster scenario using the voting_exclusions API. Whenever we have only two nodes active and registered to the cluster as voting units, _settle_voting_exclusions will exclude the non-leader unit from voting. It also excludes the leaving unit, as we need to cover for the scenario where that unit is the cluster_manager and moving from 3->2 units we may end up with stale metadata.
This PR also makes exclusion mandatory to happen at start / stop. Therefore, we are sure the voting count will always be correct in each stage.
For example, moving from 3->2->1 units results:
3->2) The cluster will set 2x voting exclusions: one for the unit leaving (if this is the cluster manager, that position will move away) and one for one of the 2x remaining units, following a sorted list
2->1) The voting exclusions are removed
Likewise, on scaling up:
1->2) One voting exclusion is added, following the same sorted list of all node names
2->3) Voting exclusions are removed
To Consider
Watch our system under it would be:
Acceptance Criteria
Issues involved
This PR touches #324, #326, #327 and in #325 this behavior is also observed. This is also linked to issues in our ha/test_storage.py, has one can see in this run.
Closes: #324, #326, #327