generated from canonical/template-operator
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4196] Rebase the plugin refactor #520
Merged
phvalguima
merged 32 commits into
DPE-4196-improve-plugin-manager
from
rebase-2024.12.07-DPE-4196-improve-plugin-manager
Dec 13, 2024
Merged
[DPE-4196] Rebase the plugin refactor #520
phvalguima
merged 32 commits into
DPE-4196-improve-plugin-manager
from
rebase-2024.12.07-DPE-4196-improve-plugin-manager
Dec 13, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR will add some extra logging so we can follow-up with any failures due to index creation.
Sync charm docs from https://discourse.charmhub.io Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>
Updating libs to latest versions.
Update the `replace()` method to (1) be more "greedy" when searching for matches in the text, including multi-line matches; and (2) fix the write back to the file. Currently, it is possible that, if we have a smaller size than the original file size, we will end up writing: <new content><left over bytes> -> and this file still has the same size as it original. This PR simplifies the logic to decide how to write the changed content. Closes #483
Moves the CI run for `test_storage.py` to use `wait_until`. Closes #476
Add `cos-agent` to the wait list. Closes #475 --------- Co-authored-by: Mehdi Bendriss <[email protected]>
## Issue We were not updating the `chain.pem` file used by the request module to send requests which caused communication cuts in a lot of cases. Also fixes #448 ## Solution Add the new chain to the `chain.pem` on CA rotation initiated and only delete it if all the nodes are done updating their CA **and** certs. ## Summary of changes This pull request introduces several changes to improve the handling of CA (Certificate Authority) rotation in the OpenSearch charm. The key changes include enhancements to the CA rotation process, ensuring proper cleanup of old CA certificates, and updating the request CA bundle. Additionally, a new method has been added to handle the completion of CA rotation. ### Improvements to CA Rotation Process: * [`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R645-R652): Added checks to ensure CA rotation is complete in the cluster before removing old CAs and updating the request bundle (`on_tls_conf_set`, `post_start_init`, `_on_update_status`). [[1]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R645-R652) [[2]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L821-R839) [[3]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R1180-R1188) * [`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R880-R911): Introduced new methods `ca_and_certs_rotation_complete_in_cluster` and `on_ca_certs_rotation_complete` to handle the completion of CA rotations and update the request bundle accordingly. [[1]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R880-R911) [[2]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L887-R957) ### Version Updates: * [`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L121-R121): Incremented `LIBPATCH` version from 2 to 3. * [`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L64-R65): Incremented `LIBPATCH` version from 1 to 2. ### Unit Test Enhancements: * [`tests/unit/lib/test_opensearch_base_charm.py`](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34L360-R373): Updated unit tests to mock new methods and ensure they are called correctly during the CA rotation process. [[1]](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34L360-R373) [[2]](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34R383-R388) ### Code Cleanup: * [`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L952-L957): Removed redundant calls to `update_request_ca_bundle` and `remove_old_ca` in `_start_opensearch` and `_post_start_init`. [[1]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L952-L957) [[2]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L1164-L1167) ### Dependency Updates: * [`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R24): Added import for `Path` from `pathlib`.
This PR extends the wait_until to search for the units' status of a given subordinate charm, based on its principal charm.
…cluster (#486) ## Issue Fixes #482 If a unit finishes updating its CA before any of the other unit start the rotation then this unit will update its certificates and that will cause communication issues with the rest of the nodes ## Solution Check if there's a `tls_ca_renewing` set on any of the units then consider the rotation started. This pull request focuses on enhancing the logic for checking the status of CA (Certificate Authority) rotation in the `opensearch_tls.py` module. The changes aim to improve the accuracy and clarity of determining whether the CA rotation process is complete across all units in the cluster. ### Enhancements to CA Rotation Logic: * **Updated Condition Checks:** - Replaced `is_ca_rotation_ongoing` with `ca_rotation_complete_in_cluster` to ensure the correct status is checked before deferring events and updating TLS resources. (`lib/charms/opensearch/v0/opensearch_tls.py`) [[1]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L213-R213) [[2]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L640-R640) * **New Method Implementation:** - Implemented `ca_rotation_complete_in_cluster` to accurately determine if CA rotation is complete by checking flags across all units. This method now tracks both ongoing and completed states more effectively. (`lib/charms/opensearch/v0/opensearch_tls.py`) * **Removed Redundant Method:** - Removed the `is_ca_rotation_ongoing` method as its functionality is now covered by the enhanced `ca_rotation_complete_in_cluster` method. (`lib/charms/opensearch/v0/opensearch_tls.py`)
## Issue In Large Deployments with cluster-manager-only nodes and data-only nodes, a deadlock might appear if the whole data-application has been powered down and needs to restart. In this case, starting the data node should be possible without acquiring the lock first, because the lock index is not available. ## Solution Start the leader of the data-nodes without acquiring the lock if it was previously started but is not running anymore. The PR also adds unit test coverage for the case where a cluster-manager-only has not initialized the security index and a failover-orchestrator with data-role joins and wants to start (see [here](#424 (comment)) for context).
## Issue Currently the integration test for re-using storage fails occasionally. This is because in one of the tests we remove the application, keeping the storage disks and re-attaching them to a new application. This can lead to stale metadata, as described in [our docs](https://charmhub.io/opensearch/docs/h-attached-storage), which will cause the Opensearch service to fail on startup. ## Solution Adjust the integration test workflow to first scale down to one remaining unit before removing the application. This will cause the remaining unit to become the leader, if it wasn't already. Removing the application now and later re-attaching this units' storage disk to the new leader means that we can start up Opensearch correctly.
This PR extends the current charm to support the following profiles: * testing: * focused in the integration tests and our CI -> 1G RAM dedicated to the Heap and no automation * staging: * HA capabilities must be available: for that, we will enforce index template that encompasses all the indices and sets replica: 1-all * Extends heap to: `max(1G, 10% of RAM)` * `indices.memory.index_buffer_size` extends to `25%` * Adds three component templates, that will be described later * production: * Same features as the staging, but heap is set instead to: `max(1G, 50% of RAM)` The options above are set based on the following documents: https://opensearch.org/docs/latest/tuning-your-cluster/performance/ https://opensearch.org/docs/latest/search-plugins/knn/performance-tuning/ The user can switch between the three options above, and depending on the selected value, the templates are created or destroyed.
There is a potential race condition within ` opensearch_relation_provider.py`, where we may ask for an user to be set (and hence, eventually call `update_certs`) before the certificate is ready. That is more often when we deploy a complete bundle at once. This PR fixes the issue by allowing `update_certs` to detect the missing `chain` config in the TLS secrets and throws a `KeyError`. Closes: #493
) ## Issue If the Opensearch operator receives a config change before the deployment description has been initialized, applying the performance profile will fail and the charm will be in a failed state. This happens repeatedly on the integration test runs for Opensearch in the data-integrator CI, e.g. [here](https://github.com/canonical/data-integrator/actions/runs/11585189263/job/32253824740). ## Solution If the deployment description does not exist yet, the performance profile should not be applied and the received event should be deferred.
Our CIs are failing with: https://github.com/canonical/opensearch-operator/actions/runs/11675736324/job/32510812135 This PR updates `actions/checkout` to v4, in line with other charms.
Sync charm docs from https://discourse.charmhub.io Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>
## Issue Some rules has duplicated names that can be a little bit confusing. ## Solution This PR separate into different names to be more clear with the alert rule expression and adds spaces between alerts to increase readability.
This PR removes one of the filters that was causing a race condition at CA rotation + large deployments. It fixes the CA rotation for large deployments in our CI. Closes #500
This PR extends our current charm to add a terraform module for the charm. It also adds github workflow to test + a basic test setup.
If a scrape fails, this might indicate that a unit is not in a healthy state. OpenSearch right now does not have a metric saying that one node is down. E.g. If the systemd service is stopped in one node, the cluster (N nodes) will drop the faulty node because connectivity issues and the metrics will show that the cluster now has N-1 nodes without saying that one node has failed. With this new alert, at least a notification will appear if one node stop being responsive. How to test: - Deploy opensearch units - Stop the opensearch daemon in one of the units The grafana-agent injects the juju topology at the alert rule, so the expression `up < 1` will filter just for OpenSearch apps: ![image](https://github.com/user-attachments/assets/d09b22be-a571-4ec2-b76d-a654186df327) The alert will trigger: ![image](https://github.com/user-attachments/assets/a1ace958-1188-4b2b-840c-febfd639dd57)
OpenSearch does not have thread pool with the name bulk. However, there is a field named "write" that can be used by bulk or writing a single instance. This PR adds the necessary logic to use the OpenSearch data coming from the exporter and because the expression is big, it was separated into smaller expressions using record. To see the original rule that was inspiration for this alert see this [repo](https://github.com/lukas-vlcek/prometheus-elasticsearch-rules/blob/master/logging_elasticsearch.rules.yaml) Fix: #503
When a cluster is in yellow state, it's because some shards are not active. This can be reached under heavy load, however if there are shards unassigned and the cluster is yellow, that means that some replicas are not allocated and it might be necessary to add new nodes in order to host all shards. --------- Co-authored-by: Mehdi Bendriss <[email protected]>
phvalguima
merged commit Dec 13, 2024
561fa89
into
DPE-4196-improve-plugin-manager
39 of 41 checks passed
phvalguima
deleted the
rebase-2024.12.07-DPE-4196-improve-plugin-manager
branch
December 13, 2024 15:49
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There were several changes on the config-changed logic and made this rebase rather complex. I am making a PR to the main refactor branch so we can look at it more carefully before having it all together.