[DPE-4196] Rebase the plugin refactor #520

phvalguima · 2024-12-07T10:32:32Z

There were several changes on the config-changed logic and made this rebase rather complex. I am making a PR to the main refactor branch so we can look at it more carefully before having it all together.

fixes #468

This PR will add some extra logging so we can follow-up with any failures due to index creation.

Sync charm docs from https://discourse.charmhub.io Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>

…arch charm (#481) Moves the grafana-agent charm to use the same series as the opensearch. Closes #477

Updating libs to latest versions.

Update the `replace()` method to (1) be more "greedy" when searching for matches in the text, including multi-line matches; and (2) fix the write back to the file. Currently, it is possible that, if we have a smaller size than the original file size, we will end up writing: <new content><left over bytes> -> and this file still has the same size as it original. This PR simplifies the logic to decide how to write the changed content. Closes #483

Moves the CI run for `test_storage.py` to use `wait_until`. Closes #476

Add `cos-agent` to the wait list. Closes #475 --------- Co-authored-by: Mehdi Bendriss <[email protected]>

## Issue We were not updating the `chain.pem` file used by the request module to send requests which caused communication cuts in a lot of cases. Also fixes #448 ## Solution Add the new chain to the `chain.pem` on CA rotation initiated and only delete it if all the nodes are done updating their CA **and** certs. ## Summary of changes This pull request introduces several changes to improve the handling of CA (Certificate Authority) rotation in the OpenSearch charm. The key changes include enhancements to the CA rotation process, ensuring proper cleanup of old CA certificates, and updating the request CA bundle. Additionally, a new method has been added to handle the completion of CA rotation. ### Improvements to CA Rotation Process: * [`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R645-R652): Added checks to ensure CA rotation is complete in the cluster before removing old CAs and updating the request bundle (`on_tls_conf_set`, `post_start_init`, `_on_update_status`). [[1]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R645-R652) [[2]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L821-R839) [[3]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R1180-R1188) * [`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R880-R911): Introduced new methods `ca_and_certs_rotation_complete_in_cluster` and `on_ca_certs_rotation_complete` to handle the completion of CA rotations and update the request bundle accordingly. [[1]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R880-R911) [[2]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L887-R957) ### Version Updates: * [`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L121-R121): Incremented `LIBPATCH` version from 2 to 3. * [`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L64-R65): Incremented `LIBPATCH` version from 1 to 2. ### Unit Test Enhancements: * [`tests/unit/lib/test_opensearch_base_charm.py`](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34L360-R373): Updated unit tests to mock new methods and ensure they are called correctly during the CA rotation process. [[1]](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34L360-R373) [[2]](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34R383-R388) ### Code Cleanup: * [`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L952-L957): Removed redundant calls to `update_request_ca_bundle` and `remove_old_ca` in `_start_opensearch` and `_post_start_init`. [[1]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L952-L957) [[2]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L1164-L1167) ### Dependency Updates: * [`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R24): Added import for `Path` from `pathlib`.

This PR extends the wait_until to search for the units' status of a given subordinate charm, based on its principal charm.

…cluster (#486) ## Issue Fixes #482 If a unit finishes updating its CA before any of the other unit start the rotation then this unit will update its certificates and that will cause communication issues with the rest of the nodes ## Solution Check if there's a `tls_ca_renewing` set on any of the units then consider the rotation started. This pull request focuses on enhancing the logic for checking the status of CA (Certificate Authority) rotation in the `opensearch_tls.py` module. The changes aim to improve the accuracy and clarity of determining whether the CA rotation process is complete across all units in the cluster. ### Enhancements to CA Rotation Logic: * **Updated Condition Checks:** - Replaced `is_ca_rotation_ongoing` with `ca_rotation_complete_in_cluster` to ensure the correct status is checked before deferring events and updating TLS resources. (`lib/charms/opensearch/v0/opensearch_tls.py`) [[1]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L213-R213) [[2]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L640-R640) * **New Method Implementation:** - Implemented `ca_rotation_complete_in_cluster` to accurately determine if CA rotation is complete by checking flags across all units. This method now tracks both ongoing and completed states more effectively. (`lib/charms/opensearch/v0/opensearch_tls.py`) * **Removed Redundant Method:** - Removed the `is_ca_rotation_ongoing` method as its functionality is now covered by the enhanced `ca_rotation_complete_in_cluster` method. (`lib/charms/opensearch/v0/opensearch_tls.py`)

## Issue In Large Deployments with cluster-manager-only nodes and data-only nodes, a deadlock might appear if the whole data-application has been powered down and needs to restart. In this case, starting the data node should be possible without acquiring the lock first, because the lock index is not available. ## Solution Start the leader of the data-nodes without acquiring the lock if it was previously started but is not running anymore. The PR also adds unit test coverage for the case where a cluster-manager-only has not initialized the security index and a failover-orchestrator with data-role joins and wants to start (see [here](#424 (comment)) for context).

## Issue Currently the integration test for re-using storage fails occasionally. This is because in one of the tests we remove the application, keeping the storage disks and re-attaching them to a new application. This can lead to stale metadata, as described in [our docs](https://charmhub.io/opensearch/docs/h-attached-storage), which will cause the Opensearch service to fail on startup. ## Solution Adjust the integration test workflow to first scale down to one remaining unit before removing the application. This will cause the remaining unit to become the leader, if it wasn't already. Removing the application now and later re-attaching this units' storage disk to the new leader means that we can start up Opensearch correctly.

This PR extends the current charm to support the following profiles: * testing: * focused in the integration tests and our CI -> 1G RAM dedicated to the Heap and no automation * staging: * HA capabilities must be available: for that, we will enforce index template that encompasses all the indices and sets replica: 1-all * Extends heap to: `max(1G, 10% of RAM)` * `indices.memory.index_buffer_size` extends to `25%` * Adds three component templates, that will be described later * production: * Same features as the staging, but heap is set instead to: `max(1G, 50% of RAM)` The options above are set based on the following documents: https://opensearch.org/docs/latest/tuning-your-cluster/performance/ https://opensearch.org/docs/latest/search-plugins/knn/performance-tuning/ The user can switch between the three options above, and depending on the selected value, the templates are created or destroyed.

There is a potential race condition within ` opensearch_relation_provider.py`, where we may ask for an user to be set (and hence, eventually call `update_certs`) before the certificate is ready. That is more often when we deploy a complete bundle at once. This PR fixes the issue by allowing `update_certs` to detect the missing `chain` config in the TLS secrets and throws a `KeyError`. Closes: #493

) ## Issue If the Opensearch operator receives a config change before the deployment description has been initialized, applying the performance profile will fail and the charm will be in a failed state. This happens repeatedly on the integration test runs for Opensearch in the data-integrator CI, e.g. [here](https://github.com/canonical/data-integrator/actions/runs/11585189263/job/32253824740). ## Solution If the deployment description does not exist yet, the performance profile should not be applied and the received event should be deferred.

Our CIs are failing with: https://github.com/canonical/opensearch-operator/actions/runs/11675736324/job/32510812135 This PR updates `actions/checkout` to v4, in line with other charms.

Sync charm docs from https://discourse.charmhub.io Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>

## Issue Some rules has duplicated names that can be a little bit confusing. ## Solution This PR separate into different names to be more clear with the alert rule expression and adds spaces between alerts to increase readability.

This PR removes one of the filters that was causing a race condition at CA rotation + large deployments. It fixes the CA rotation for large deployments in our CI. Closes #500

This PR extends our current charm to add a terraform module for the charm. It also adds github workflow to test + a basic test setup.

If a scrape fails, this might indicate that a unit is not in a healthy state. OpenSearch right now does not have a metric saying that one node is down. E.g. If the systemd service is stopped in one node, the cluster (N nodes) will drop the faulty node because connectivity issues and the metrics will show that the cluster now has N-1 nodes without saying that one node has failed. With this new alert, at least a notification will appear if one node stop being responsive. How to test: - Deploy opensearch units - Stop the opensearch daemon in one of the units The grafana-agent injects the juju topology at the alert rule, so the expression `up < 1` will filter just for OpenSearch apps: ![image](https://github.com/user-attachments/assets/d09b22be-a571-4ec2-b76d-a654186df327) The alert will trigger: ![image](https://github.com/user-attachments/assets/a1ace958-1188-4b2b-840c-febfd639dd57)

OpenSearch does not have thread pool with the name bulk. However, there is a field named "write" that can be used by bulk or writing a single instance. This PR adds the necessary logic to use the OpenSearch data coming from the exporter and because the expression is big, it was separated into smaller expressions using record. To see the original rule that was inspiration for this alert see this [repo](https://github.com/lukas-vlcek/prometheus-elasticsearch-rules/blob/master/logging_elasticsearch.rules.yaml) Fix: #503

When a cluster is in yellow state, it's because some shards are not active. This can be reached under heavy load, however if there are shards unassigned and the cluster is yellow, that means that some replicas are not allocated and it might be necessary to add new nodes in order to host all shards. --------- Co-authored-by: Mehdi Bendriss <[email protected]>

skourta and others added 30 commits October 13, 2024 14:25

[DPE-5602] - fix unit tests for TLS in python 3.12 (#469)

3bc81d7

fixes #468

Increase logging when index creation fails (#474)

21e9b11

This PR will add some extra logging so we can follow-up with any failures due to index creation.

Sync docs from Discourse (#464)

2cc457c

Sync charm docs from https://discourse.charmhub.io Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>

[DPE-5668] Move grafana to use the same SERIES constant as the opense…

bd44468

…arch charm (#481) Moves the grafana-agent charm to use the same series as the opensearch. Closes #477

[DPE-5663] Update libs (#480)

c039273

Updating libs to latest versions.

[DPE-5667] Move away from ops_test's wait_for_idle to wait_until (#479)

f10eafb

Moves the CI run for `test_storage.py` to use `wait_until`. Closes #476

[DPE-5666] Extend wait_until to also wait on cos-agent (#478)

32050d8

Add `cos-agent` to the wait list. Closes #475 --------- Co-authored-by: Mehdi Bendriss <[email protected]>

Add subordinate support to wait_until (#487)

c9ade08

This PR extends the wait_until to search for the units' status of a given subordinate charm, based on its principal charm.

Fix action/checkout version to v4 (#498)

405725f

Our CIs are failing with: https://github.com/canonical/opensearch-operator/actions/runs/11675736324/job/32510812135 This PR updates `actions/checkout` to v4, in line with other charms.

Sync docs from Discourse (#499)

56736df

Sync charm docs from https://discourse.charmhub.io Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>

Change duplicated rules names (#504)

e3f38a8

## Issue Some rules has duplicated names that can be a little bit confusing. ## Solution This PR separate into different names to be more clear with the alert rule expression and adds spaces between alerts to increase readability.

[DPE-5932] Remove cert-available check for CA rotation (#502)

727d283

This PR removes one of the filters that was causing a race condition at CA rotation + large deployments. It fixes the CA rotation for large deployments in our CI. Closes #500

[DPE-5866] Add Charm Terraform Module (#501)

1fa589c

This PR extends our current charm to add a terraform module for the charm. It also adds github workflow to test + a basic test setup.

Merge branch '2/edge' into DPE-4196-improve-plugin-manager

1c2ea55

Update config_changed + unit tests

0e216a5

Fix merge conflicts in test_plugins.py

c717a29

Fix merge conflicts in test_plugins.py (2)

1bc0321

Fix merge conflicts in test_plugins.py (3)

023d4c5

Fix merge conflicts in test_plugins.py (3)

d39fac8

phvalguima added 2 commits December 8, 2024 13:56

Fix merge conflicts in test_plugins.py (4)

b1da6a9

Fix merge conflicts in test_plugins.py (5)

cdb9ade

phvalguima marked this pull request as ready for review December 13, 2024 15:49

phvalguima merged commit 561fa89 into DPE-4196-improve-plugin-manager Dec 13, 2024
39 of 41 checks passed

phvalguima deleted the rebase-2024.12.07-DPE-4196-improve-plugin-manager branch December 13, 2024 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-4196] Rebase the plugin refactor #520

[DPE-4196] Rebase the plugin refactor #520

phvalguima commented Dec 7, 2024

[DPE-4196] Rebase the plugin refactor #520

[DPE-4196] Rebase the plugin refactor #520

Conversation

phvalguima commented Dec 7, 2024