Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

Closed
wants to merge 44 commits into from

Conversation

phvalguima
Copy link
Contributor

@phvalguima phvalguima commented Jul 15, 2024

Currently, we are having several issues with 2-node clusters as OpenSearch won't automatically manage voting anymore. This PR will add logic to manually manage the 2-node cluster scenario using the voting_exclusions API. Whenever we have only two nodes active and registered to the cluster as voting units, _settle_voting_exclusions will exclude the non-leader unit from voting. It also excludes the leaving unit, as we need to cover for the scenario where that unit is the cluster_manager and moving from 3->2 units we may end up with stale metadata.

This PR also makes exclusion mandatory to happen at start / stop. Therefore, we are sure the voting count will always be correct in each stage.

For example, moving from 3->2->1 units results:
3->2) The cluster will set 2x voting exclusions: one for the unit leaving (if this is the cluster manager, that position will move away) and one for one of the 2x remaining units, following a sorted list
2->1) The voting exclusions are removed

Likewise, on scaling up:
1->2) One voting exclusion is added, following the same sorted list of all node names
2->3) Voting exclusions are removed

To Consider

Watch our system under it would be:

  • 2 nodes cluster
  • Unexpected (crash for ex or SIGTERM) Loss of elected CM node.
  • HA tests (currently run on 3+ nodes cluster, we should ensure the same behvior remains on the 2 nodes cluster)
  • See behavior of the last remaining node (previously added to the voting exclusions list)
  • We should ensure that a scaleup / restart removes the voting exclusion of already applied for the current node (1 or 2 nodes clusters).

Acceptance Criteria

  • handle voting exclusion in normal case
  • handle voting exclusion removal (so that on scale-up / down the node can still participate in the election)
  • scaling-up/down/ha Integration tests should pass
  • test-storage should also pass (fixing stale metadata issue)
  • fix of “shards don’t get assigned“ issue (side effect of the cluster not being able to vote for a CM)

Issues involved

This PR touches #324, #326, #327 and in #325 this behavior is also observed. This is also linked to issues in our ha/test_storage.py, has one can see in this run.

Closes: #324, #326, #327

@phvalguima phvalguima changed the title [DPE-4886] Add integration and unit tests to voting exclusions [DPE-4575][DPE-4886] Add voting exclusions management Jul 16, 2024
Copy link
Contributor

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Pedro, the PR looks quite good. I've tested esp. the scenario with re-attaching storage locally multiple times and could not reproduce the issue anymore. That's good!

I just left some questions and comments for details, but in general I'd say this is fine.

@@ -537,8 +543,6 @@ def _on_update_status(self, event: UpdateStatusEvent):

# if there are exclusions to be removed
if self.unit.is_leader():
self.opensearch_exclusions.cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While before this cleanup was done on every update_status, now it's only done when the Health is green. Is this on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not the case... It is done on every case, except HealthColors.UNKNOWN. Indeed, we defer the event if it is not green... I put it down there because I need the API to be responsive before configuring voting exclusions. If it is not responsive, we will get UNKNOWN anyways and retry later anyways.

I will add some comments to clarify that.

lib/charms/opensearch/v0/helper_cluster.py Outdated Show resolved Hide resolved
self.opensearch_exclusions.add_voting(hosts, node_names=[sorted_cm[0]])
# Now, we clean up the sorted_cm list, as we want to be sure the new manager is elected
# and different than the excluded units.
sorted_cm.pop(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why you chose to exclude node [0] from voting and remove it from the list? This will result in a new cluster manager node being elected when you scale up from 1 unit to more and in the process of removing the application. I just saw this locally when testing, not that it creates that much latency, but wouldn't it be better to use the last one from the list instead?

Copy link
Contributor Author

@phvalguima phvalguima Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, @reneradoi yes, I noticed that as well. But when I was discussing with @Mehdi-Bendriss, we've agreed to make this list predictable instead of keeping track of the cluster manager. I could add a check here, for that, but then the other check, right before, gets slightly more complicated:

            if unit_is_stopping:
                # Remove both this unit and the first sorted_cm from the voting
                self.opensearch_exclusions.add_voting(
                    hosts, node_names=[self.unit_name, sorted_cm[0]]  ## <<<------ should we also add a check here?
                )

Given this is quite an exception (i.e. going from 1->2 or 2->1), I took the simpler approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can improve the comments around here, this logic is pretty brittle tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I overlooked and missed a safety component there, where if we remain with 2 units - and the one being removed is not the current elected CM; maybe it makes sense to add the other unit to the voting exclusion.
This should reduce switchovers and the risk it entails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mehdi-Bendriss I would not recommend that. The main reason is because I noticed moving the elected manager between nodes is far faster than juju hooks. We need to be predictable in this specific case, even if it means moving the elected manager.

@phvalguima phvalguima requested a review from reneradoi July 17, 2024 11:38
Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pedro - I have a few questions. There is also the aspect of the missing exception handling of settle_voting in various places in the code.

lib/charms/opensearch/v0/helper_cluster.py Outdated Show resolved Hide resolved
@@ -573,6 +585,9 @@ def _on_config_changed(self, event: ConfigChangedEvent): # noqa C901
restart_requested = False
if self.opensearch_config.update_host_if_needed():
restart_requested = True
# Review voting exclusions as our IP has changed: we may be coming back from a network
# outage case.
self._settle_voting_exclusions(unit_is_stopping=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen here if no other node is online self.alt_hosts == []? I believe this will try for 5min and eventually crash with a RetryError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mehdi-Bendriss so, self.alt_hosts will not return the local host in the list? How can I get it then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the way we call both ClusterTopology's elected_manager and nodes. They will now do more checks on self.alt_hosts.

lib/charms/opensearch/v0/opensearch_base_charm.py Outdated Show resolved Hide resolved
@@ -537,17 +548,26 @@ def _on_update_status(self, event: UpdateStatusEvent):

# if there are exclusions to be removed
if self.unit.is_leader():
self.opensearch_exclusions.cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the shards allocation exclusion cleanup is postponed until later in the hook? As long as there is connectivity to a host, we should be able to cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the health checks below allow anything pass, unless the cluster is really on a bad state (i.e. UNKNOWN). So, moved the check below these first health checks because it makes more sense.

lib/charms/opensearch/v0/opensearch_base_charm.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_nodes_exclusions.py Outdated Show resolved Hide resolved
resp_status_code=True,
retries=3,
)
return True
except OpenSearchHttpError:
return False

def _delete_voting(self) -> bool:
def delete_voting(self, alt_hosts: Optional[List[str]] = None) -> bool:
"""Remove all the voting exclusions - cannot target 1 exclusion at a time."""
# "wait_for_removal" is VERY important, it removes all voting configs immediately
# and allows any node to return to the voting config in the future
try:
self._opensearch.request(
"DELETE",
"/_cluster/voting_config_exclusions?wait_for_removal=false",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not remember why did I set in the past wait_for_removal=false instead of true. I now don't see a reason why it shouldn't be true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mehdi-Bendriss on my own tests, I could not get this DELETE to work with wait_for_removal=true... I wanted to move it to True as well, btw!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mehdi-Bendriss, found this in the docs:

Defaults to true, meaning that all excluded nodes must be removed from the cluster before this API takes any action.

So, wait_for_removal=true expects the node to be gone entirely... That is why it is not suitable for us.

tests/integration/ha/continuous_writes.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_base_charm.py Outdated Show resolved Hide resolved
@juditnovak juditnovak linked an issue Jul 25, 2024 that may be closed by this pull request
@phvalguima phvalguima changed the title [DPE-4575][DPE-4886] Add voting exclusions management [DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management Aug 8, 2024
@phvalguima phvalguima closed this Aug 19, 2024
reneradoi added a commit that referenced this pull request Sep 6, 2024
## Issue
When a new TLS certificate authority (CA) certificate is issued, the
opensearch-operator should add this new CA to all its units and request
new certificates. The new certificates (including the CA certificate)
should be distributed to all OpenSearch nodes in a rolling restart
manner, without downtime to the entire cluster.

Due to limitations on the self-signed-certificates operator it is not
possible to:
- get a notice if a CA certificate is about to expire
- request a new CA when the current one is about to or has expired
- request an intermediate CA and sign future certificates with it

There is currently no support for renewing a root / CA certificate on
the self-signed-certificates operator. A new root / CA certificate will
only be generated and issued if the common_name of the CA changes.

We have decided to implement the logic in that way that we check each
certificate if it includes a new CA. If so, we store the new CA and
initiate the CA rotation workflow on OpenSearch.

## Solution

This PR implements the following workflow:
- check each `CertificateAvailableEvent` if it includes a new CA
- add the new CA to the truststore
- add a notice `tls_ca_renewing` to the unit's peer data
- initiate a restart of OpenSearch (using the locking mechanism to
coordinate cluster availability during the restart)
- after restarting, add a notice `tls_ca_renewed` to the unit's peer
data
- when the restart is done on all of the cluster nodes, request new TLS
certificates and apply them to the node

During the phase of renewing the CA, all incoming
`CertificateAvailableEvents` will be deferred in order to avoid
incompatibilites in communication between the nodes.

Please also see the flow of events and actions that has been documented
here:
https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow

## Notes
- There is a dependency to
#367 because during
the rolling restart when the CA is rotated it is very likely that the
voting exclusion issue shows up (at least in 3-node-clusters). Therefore
the integration test is currently running only with two nodes. Once the
voting exclusions issue is resolved, this can be updated to the usual
three nodes.
- Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more
details see opensearch-project/security#3299).
- This PR introduces a method to append configuration to the jvm options
file of OpenSearch (used to set TLS config to v1.2).

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
Co-authored-by: Judit Novak <[email protected]>
skourta pushed a commit that referenced this pull request Sep 18, 2024
When a new TLS certificate authority (CA) certificate is issued, the
opensearch-operator should add this new CA to all its units and request
new certificates. The new certificates (including the CA certificate)
should be distributed to all OpenSearch nodes in a rolling restart
manner, without downtime to the entire cluster.

Due to limitations on the self-signed-certificates operator it is not
possible to:
- get a notice if a CA certificate is about to expire
- request a new CA when the current one is about to or has expired
- request an intermediate CA and sign future certificates with it

There is currently no support for renewing a root / CA certificate on
the self-signed-certificates operator. A new root / CA certificate will
only be generated and issued if the common_name of the CA changes.

We have decided to implement the logic in that way that we check each
certificate if it includes a new CA. If so, we store the new CA and
initiate the CA rotation workflow on OpenSearch.

This PR implements the following workflow:
- check each `CertificateAvailableEvent` if it includes a new CA
- add the new CA to the truststore
- add a notice `tls_ca_renewing` to the unit's peer data
- initiate a restart of OpenSearch (using the locking mechanism to
coordinate cluster availability during the restart)
- after restarting, add a notice `tls_ca_renewed` to the unit's peer
data
- when the restart is done on all of the cluster nodes, request new TLS
certificates and apply them to the node

During the phase of renewing the CA, all incoming
`CertificateAvailableEvents` will be deferred in order to avoid
incompatibilites in communication between the nodes.

Please also see the flow of events and actions that has been documented
here:
https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow

- There is a dependency to
#367 because during
the rolling restart when the CA is rotated it is very likely that the
voting exclusion issue shows up (at least in 3-node-clusters). Therefore
the integration test is currently running only with two nodes. Once the
voting exclusions issue is resolved, this can be updated to the usual
three nodes.
- Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more
details see opensearch-project/security#3299).
- This PR introduces a method to append configuration to the jvm options
file of OpenSearch (used to set TLS config to v1.2).

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
Co-authored-by: Judit Novak <[email protected]>
skourta pushed a commit that referenced this pull request Sep 18, 2024
When a new TLS certificate authority (CA) certificate is issued, the
opensearch-operator should add this new CA to all its units and request
new certificates. The new certificates (including the CA certificate)
should be distributed to all OpenSearch nodes in a rolling restart
manner, without downtime to the entire cluster.

Due to limitations on the self-signed-certificates operator it is not
possible to:
- get a notice if a CA certificate is about to expire
- request a new CA when the current one is about to or has expired
- request an intermediate CA and sign future certificates with it

There is currently no support for renewing a root / CA certificate on
the self-signed-certificates operator. A new root / CA certificate will
only be generated and issued if the common_name of the CA changes.

We have decided to implement the logic in that way that we check each
certificate if it includes a new CA. If so, we store the new CA and
initiate the CA rotation workflow on OpenSearch.

This PR implements the following workflow:
- check each `CertificateAvailableEvent` if it includes a new CA
- add the new CA to the truststore
- add a notice `tls_ca_renewing` to the unit's peer data
- initiate a restart of OpenSearch (using the locking mechanism to
coordinate cluster availability during the restart)
- after restarting, add a notice `tls_ca_renewed` to the unit's peer
data
- when the restart is done on all of the cluster nodes, request new TLS
certificates and apply them to the node

During the phase of renewing the CA, all incoming
`CertificateAvailableEvents` will be deferred in order to avoid
incompatibilites in communication between the nodes.

Please also see the flow of events and actions that has been documented
here:
https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow

- There is a dependency to
#367 because during
the rolling restart when the CA is rotated it is very likely that the
voting exclusion issue shows up (at least in 3-node-clusters). Therefore
the integration test is currently running only with two nodes. Once the
voting exclusions issue is resolved, this can be updated to the usual
three nodes.
- Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more
details see opensearch-project/security#3299).
- This PR introduces a method to append configuration to the jvm options
file of OpenSearch (used to set TLS config to v1.2).

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
Co-authored-by: Judit Novak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Stuck in waiting One or more replica shards...
3 participants