[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

phvalguima · 2024-07-15T19:53:17Z

Currently, we are having several issues with 2-node clusters as OpenSearch won't automatically manage voting anymore. This PR will add logic to manually manage the 2-node cluster scenario using the voting_exclusions API. Whenever we have only two nodes active and registered to the cluster as voting units, _settle_voting_exclusions will exclude the non-leader unit from voting. It also excludes the leaving unit, as we need to cover for the scenario where that unit is the cluster_manager and moving from 3->2 units we may end up with stale metadata.

This PR also makes exclusion mandatory to happen at start / stop. Therefore, we are sure the voting count will always be correct in each stage.

For example, moving from 3->2->1 units results:
3->2) The cluster will set 2x voting exclusions: one for the unit leaving (if this is the cluster manager, that position will move away) and one for one of the 2x remaining units, following a sorted list
2->1) The voting exclusions are removed

Likewise, on scaling up:
1->2) One voting exclusion is added, following the same sorted list of all node names
2->3) Voting exclusions are removed

To Consider

Watch our system under it would be:

2 nodes cluster
Unexpected (crash for ex or SIGTERM) Loss of elected CM node.
HA tests (currently run on 3+ nodes cluster, we should ensure the same behvior remains on the 2 nodes cluster)
See behavior of the last remaining node (previously added to the voting exclusions list)
We should ensure that a scaleup / restart removes the voting exclusion of already applied for the current node (1 or 2 nodes clusters).

Acceptance Criteria

handle voting exclusion in normal case
handle voting exclusion removal (so that on scale-up / down the node can still participate in the election)
scaling-up/down/ha Integration tests should pass
test-storage should also pass (fixing stale metadata issue)
fix of “shards don’t get assigned“ issue (side effect of the cluster not being able to vote for a CM)

Issues involved

This PR touches #324, #326, #327 and in #325 this behavior is also observed. This is also linked to issues in our ha/test_storage.py, has one can see in this run.

Closes: #324, #326, #327

…ix issue on update_host_if_needed

reneradoi

Hey Pedro, the PR looks quite good. I've tested esp. the scenario with re-attaching storage locally multiple times and could not reproduce the issue anymore. That's good!

I just left some questions and comments for details, but in general I'd say this is fine.

reneradoi · 2024-07-17T06:33:16Z

lib/charms/opensearch/v0/opensearch_base_charm.py

@@ -537,8 +543,6 @@ def _on_update_status(self, event: UpdateStatusEvent):

        # if there are exclusions to be removed
        if self.unit.is_leader():
-            self.opensearch_exclusions.cleanup()


While before this cleanup was done on every update_status, now it's only done when the Health is green. Is this on purpose?

That is not the case... It is done on every case, except HealthColors.UNKNOWN. Indeed, we defer the event if it is not green... I put it down there because I need the API to be responsive before configuring voting exclusions. If it is not responsive, we will get UNKNOWN anyways and retry later anyways.

I will add some comments to clarify that.

lib/charms/opensearch/v0/helper_cluster.py

reneradoi · 2024-07-17T11:10:26Z

lib/charms/opensearch/v0/opensearch_base_charm.py

+                self.opensearch_exclusions.add_voting(hosts, node_names=[sorted_cm[0]])
+            # Now, we clean up the sorted_cm list, as we want to be sure the new manager is elected
+            # and different than the excluded units.
+            sorted_cm.pop(0)


I was wondering why you chose to exclude node [0] from voting and remove it from the list? This will result in a new cluster manager node being elected when you scale up from 1 unit to more and in the process of removing the application. I just saw this locally when testing, not that it creates that much latency, but wouldn't it be better to use the last one from the list instead?

so, @reneradoi yes, I noticed that as well. But when I was discussing with @Mehdi-Bendriss, we've agreed to make this list predictable instead of keeping track of the cluster manager. I could add a check here, for that, but then the other check, right before, gets slightly more complicated:

if unit_is_stopping: # Remove both this unit and the first sorted_cm from the voting self.opensearch_exclusions.add_voting( hosts, node_names=[self.unit_name, sorted_cm[0]] ## <<<------ should we also add a check here? )

Given this is quite an exception (i.e. going from 1->2 or 2->1), I took the simpler approach.

I can improve the comments around here, this logic is pretty brittle tbh.

I think I overlooked and missed a safety component there, where if we remain with 2 units - and the one being removed is not the current elected CM; maybe it makes sense to add the other unit to the voting exclusion.
This should reduce switchovers and the risk it entails

@Mehdi-Bendriss I would not recommend that. The main reason is because I noticed moving the elected manager between nodes is far faster than juju hooks. We need to be predictable in this specific case, even if it means moving the elected manager.

tests/integration/ha/continuous_writes.py

Mehdi-Bendriss

Thanks Pedro - I have a few questions. There is also the aspect of the missing exception handling of settle_voting in various places in the code.

lib/charms/opensearch/v0/helper_cluster.py

Mehdi-Bendriss · 2024-07-16T10:57:53Z

lib/charms/opensearch/v0/opensearch_base_charm.py

@@ -573,6 +585,9 @@ def _on_config_changed(self, event: ConfigChangedEvent):  # noqa C901
        restart_requested = False
        if self.opensearch_config.update_host_if_needed():
            restart_requested = True
+            # Review voting exclusions as our IP has changed: we may be coming back from a network
+            # outage case.
+            self._settle_voting_exclusions(unit_is_stopping=False)


What will happen here if no other node is online self.alt_hosts == []? I believe this will try for 5min and eventually crash with a RetryError.

@Mehdi-Bendriss so, self.alt_hosts will not return the local host in the list? How can I get it then?

Changed the way we call both ClusterTopology's elected_manager and nodes. They will now do more checks on self.alt_hosts.

lib/charms/opensearch/v0/opensearch_base_charm.py

Mehdi-Bendriss · 2024-07-22T13:11:53Z

lib/charms/opensearch/v0/opensearch_base_charm.py

@@ -537,17 +548,26 @@ def _on_update_status(self, event: UpdateStatusEvent):

        # if there are exclusions to be removed
        if self.unit.is_leader():
-            self.opensearch_exclusions.cleanup()


Is there a reason why the shards allocation exclusion cleanup is postponed until later in the hook? As long as there is connectivity to a host, we should be able to cleanup.

Yes, the health checks below allow anything pass, unless the cluster is really on a bad state (i.e. UNKNOWN). So, moved the check below these first health checks because it makes more sense.

lib/charms/opensearch/v0/opensearch_base_charm.py

lib/charms/opensearch/v0/opensearch_nodes_exclusions.py

Mehdi-Bendriss · 2024-07-22T22:18:21Z

lib/charms/opensearch/v0/opensearch_nodes_exclusions.py

                resp_status_code=True,
                retries=3,
            )
            return True
        except OpenSearchHttpError:
            return False

-    def _delete_voting(self) -> bool:
+    def delete_voting(self, alt_hosts: Optional[List[str]] = None) -> bool:
        """Remove all the voting exclusions - cannot target 1 exclusion at a time."""
        # "wait_for_removal" is VERY important, it removes all voting configs immediately
        # and allows any node to return to the voting config in the future
        try:
            self._opensearch.request(
                "DELETE",
                "/_cluster/voting_config_exclusions?wait_for_removal=false",


I do not remember why did I set in the past wait_for_removal=false instead of true. I now don't see a reason why it shouldn't be true

@Mehdi-Bendriss on my own tests, I could not get this DELETE to work with wait_for_removal=true... I wanted to move it to True as well, btw!

@Mehdi-Bendriss, found this in the docs:

Defaults to true, meaning that all excluded nodes must be removed from the cluster before this API takes any action.

So, wait_for_removal=true expects the node to be gone entirely... That is why it is not suitable for us.

lib/charms/opensearch/v0/opensearch_nodes_exclusions.py

tests/integration/ha/continuous_writes.py

lib/charms/opensearch/v0/opensearch_base_charm.py

…oting-exclusion-2-units

…g only

…-exclusion-2-units

… to after settle_voting

…-exclusion-2-units

…calling update-status in test_ha

## Issue When a new TLS certificate authority (CA) certificate is issued, the opensearch-operator should add this new CA to all its units and request new certificates. The new certificates (including the CA certificate) should be distributed to all OpenSearch nodes in a rolling restart manner, without downtime to the entire cluster. Due to limitations on the self-signed-certificates operator it is not possible to: - get a notice if a CA certificate is about to expire - request a new CA when the current one is about to or has expired - request an intermediate CA and sign future certificates with it There is currently no support for renewing a root / CA certificate on the self-signed-certificates operator. A new root / CA certificate will only be generated and issued if the common_name of the CA changes. We have decided to implement the logic in that way that we check each certificate if it includes a new CA. If so, we store the new CA and initiate the CA rotation workflow on OpenSearch. ## Solution This PR implements the following workflow: - check each `CertificateAvailableEvent` if it includes a new CA - add the new CA to the truststore - add a notice `tls_ca_renewing` to the unit's peer data - initiate a restart of OpenSearch (using the locking mechanism to coordinate cluster availability during the restart) - after restarting, add a notice `tls_ca_renewed` to the unit's peer data - when the restart is done on all of the cluster nodes, request new TLS certificates and apply them to the node During the phase of renewing the CA, all incoming `CertificateAvailableEvents` will be deferred in order to avoid incompatibilites in communication between the nodes. Please also see the flow of events and actions that has been documented here: https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow ## Notes - There is a dependency to #367 because during the rolling restart when the CA is rotated it is very likely that the voting exclusion issue shows up (at least in 3-node-clusters). Therefore the integration test is currently running only with two nodes. Once the voting exclusions issue is resolved, this can be updated to the usual three nodes. - Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more details see opensearch-project/security#3299). - This PR introduces a method to append configuration to the jvm options file of OpenSearch (used to set TLS config to v1.2). --------- Co-authored-by: Mehdi Bendriss <[email protected]> Co-authored-by: Judit Novak <[email protected]>

When a new TLS certificate authority (CA) certificate is issued, the opensearch-operator should add this new CA to all its units and request new certificates. The new certificates (including the CA certificate) should be distributed to all OpenSearch nodes in a rolling restart manner, without downtime to the entire cluster. Due to limitations on the self-signed-certificates operator it is not possible to: - get a notice if a CA certificate is about to expire - request a new CA when the current one is about to or has expired - request an intermediate CA and sign future certificates with it There is currently no support for renewing a root / CA certificate on the self-signed-certificates operator. A new root / CA certificate will only be generated and issued if the common_name of the CA changes. We have decided to implement the logic in that way that we check each certificate if it includes a new CA. If so, we store the new CA and initiate the CA rotation workflow on OpenSearch. This PR implements the following workflow: - check each `CertificateAvailableEvent` if it includes a new CA - add the new CA to the truststore - add a notice `tls_ca_renewing` to the unit's peer data - initiate a restart of OpenSearch (using the locking mechanism to coordinate cluster availability during the restart) - after restarting, add a notice `tls_ca_renewed` to the unit's peer data - when the restart is done on all of the cluster nodes, request new TLS certificates and apply them to the node During the phase of renewing the CA, all incoming `CertificateAvailableEvents` will be deferred in order to avoid incompatibilites in communication between the nodes. Please also see the flow of events and actions that has been documented here: https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow - There is a dependency to #367 because during the rolling restart when the CA is rotated it is very likely that the voting exclusion issue shows up (at least in 3-node-clusters). Therefore the integration test is currently running only with two nodes. Once the voting exclusions issue is resolved, this can be updated to the usual three nodes. - Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more details see opensearch-project/security#3299). - This PR introduces a method to append configuration to the jvm options file of OpenSearch (used to set TLS config to v1.2). --------- Co-authored-by: Mehdi Bendriss <[email protected]> Co-authored-by: Judit Novak <[email protected]>

phvalguima added 20 commits July 10, 2024 09:32

Rebase and add support for voting exclusions

52fcb0d

Add integration test for 3->1->3 scaling; plus fixes

8802410

Add fix for scaling test

d360fe9

Fix issue with continuous_writes

05c828b

Add rudimentary check for shard relocation

186dadd

Move to self.unit_name

784dd0f

Add safeguards at stopping for the app removal case

b4b9fde

Move the _get_nodes up in _stop_opensearch

b9becc8

Focus remove-app test to only remove opensearch

f22f75b

Add explain call to update status

ce0c3e6

Add explain call to update status

3e59a79

Add explain call to update status

b3a9c11

Add cluster explain API

1bc6555

Fix explain on node_lock

ee49e33

Node lock testing -- move logging

5a9e006

Add try/catch for the allocation

e633312

Extend voting exclusion settling to scenarios: covers outage cases; f…

fc148a5

…ix issue on update_host_if_needed

Add unit tests for exclusions logic

68ca4f0

Add fixes to unit tests and removed commented code

8fa6f29

Add fixes for integration tests and unit tests

48c511e

phvalguima requested review from Mehdi-Bendriss and reneradoi July 16, 2024 07:29

phvalguima changed the title ~~[DPE-4886] Add integration and unit tests to voting exclusions~~ [DPE-4575][DPE-4886] Add voting exclusions management Jul 16, 2024

reneradoi reviewed Jul 17, 2024

View reviewed changes

Update helper_cluster.py

3b16370

phvalguima requested a review from reneradoi July 17, 2024 11:38

phvalguima added 2 commits July 18, 2024 15:51

Add retry to the lock

4004ba5

Remove fix for dashboards, diff PR; add comments on update_status

6c2137c

phvalguima mentioned this pull request Jul 19, 2024

publish_host is a single IP, not a list #373

Merged

reneradoi mentioned this pull request Jul 19, 2024

[DPE-4656] add TLS CA rotation routine #353

Merged

phvalguima commented Jul 22, 2024

View reviewed changes

tests/integration/ha/continuous_writes.py Outdated Show resolved Hide resolved

phvalguima added 2 commits July 22, 2024 17:57

Update helper_cluster.py

c45e7e9

Update helper_cluster.py

4591f2b

Mehdi-Bendriss reviewed Jul 22, 2024

View reviewed changes

juditnovak linked an issue Jul 25, 2024 that may be closed by this pull request

[BUG] Stuck in waiting #356

Open

phvalguima added 5 commits August 6, 2024 13:41

Updates following review

6ff1ad9

Merge remote-tracking branch 'origin/main' into with-tests-DPE-4057-v…

9e0e20d

…oting-exclusion-2-units

fix elected_manager to return the Node object instead of the ID strin…

c3b71e7

…g only

Fixes for exclusions, locking and health

3ebadee

Extend to a larger runner

f0e0fec

phvalguima changed the title ~~[DPE-4575][DPE-4886] Add voting exclusions management~~ [DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management Aug 8, 2024

phvalguima mentioned this pull request Aug 8, 2024

[DPE-4931] fix locking with unassigned shards #387

Merged

phvalguima added 14 commits August 8, 2024 19:26

WIP: updating integration test_ha for 2 nodes

bbca393

WIP(2): updating integration test_ha for 2 nodes

adeef52

Merge remote-tracking branch 'origin' into with-tests-DPE-4057-voting…

3d74325

…-exclusion-2-units

Fix scale up and down test

e14ae76

Update from main branch

b909c16

move from xlarge to large

af9fa25

Add more info in the test_restart_db_process_node_with_elected_cm

7c1595a

add more logging

3851b2d

_on_update_status: remove the is_node_up() from the start and move it…

631b6a8

… to after settle_voting

Merge remote-tracking branch 'origin' into with-tests-DPE-4057-voting…

8972bf3

…-exclusion-2-units

Fix node exclusions to also check for service running, plus manually …

ad524e6

…calling update-status in test_ha

remove pdb mentions to test_ha

77b1c67

Extend is_active to consider other cases such as an stopped process

d84e1d5

Add missing rstrip()

bd97709

phvalguima closed this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

phvalguima commented Jul 15, 2024 •

edited

Loading

reneradoi left a comment

reneradoi Jul 17, 2024

phvalguima Jul 17, 2024

reneradoi Jul 17, 2024

phvalguima Jul 17, 2024 •

edited

Loading

phvalguima Jul 17, 2024

Mehdi-Bendriss Jul 22, 2024

phvalguima Aug 8, 2024

Mehdi-Bendriss left a comment

Mehdi-Bendriss Jul 16, 2024

phvalguima Jul 25, 2024

phvalguima Aug 6, 2024

Mehdi-Bendriss Jul 22, 2024

phvalguima Jul 25, 2024

Mehdi-Bendriss Jul 22, 2024

phvalguima Jul 25, 2024

phvalguima Aug 12, 2024

[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

Conversation

phvalguima commented Jul 15, 2024 • edited Loading

To Consider

Acceptance Criteria

Issues involved

reneradoi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phvalguima Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phvalguima commented Jul 15, 2024 •

edited

Loading

phvalguima Jul 17, 2024 •

edited

Loading