Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4196] Rebase the plugin refactor #520

Conversation

phvalguima
Copy link
Contributor

There were several changes on the config-changed logic and made this rebase rather complex. I am making a PR to the main refactor branch so we can look at it more carefully before having it all together.

skourta and others added 30 commits October 13, 2024 14:25
This PR will add some extra logging so we can follow-up with any failures due to index creation.
Sync charm docs from https://discourse.charmhub.io

Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>
…arch charm (#481)

Moves the grafana-agent charm to use the same series as the opensearch.

Closes #477
Updating libs to latest versions.
Update the `replace()` method to (1) be more "greedy" when searching for
matches in the text, including multi-line matches; and (2) fix the write
back to the file.

Currently, it is possible that, if we have a smaller size than the
original file size, we will end up writing:
<new content><left over bytes> -> and this file still has the same size
as it original.

This PR simplifies the logic to decide how to write the changed content.

Closes #483
Moves the CI run for `test_storage.py` to use `wait_until`.

Closes #476
Add `cos-agent` to the wait list.

Closes #475

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
## Issue
We were not updating the `chain.pem` file used by the request module to
send requests which caused communication cuts in a lot of cases.
Also fixes #448

## Solution
Add the new chain to the `chain.pem` on CA rotation initiated and only
delete it if all the nodes are done updating their CA **and** certs.


## Summary of changes
This pull request introduces several changes to improve the handling of
CA (Certificate Authority) rotation in the OpenSearch charm. The key
changes include enhancements to the CA rotation process, ensuring proper
cleanup of old CA certificates, and updating the request CA bundle.
Additionally, a new method has been added to handle the completion of CA
rotation.

### Improvements to CA Rotation Process:
*
[`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R645-R652):
Added checks to ensure CA rotation is complete in the cluster before
removing old CAs and updating the request bundle (`on_tls_conf_set`,
`post_start_init`, `_on_update_status`).
[[1]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R645-R652)
[[2]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L821-R839)
[[3]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43R1180-R1188)
*
[`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R880-R911):
Introduced new methods `ca_and_certs_rotation_complete_in_cluster` and
`on_ca_certs_rotation_complete` to handle the completion of CA rotations
and update the request bundle accordingly.
[[1]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R880-R911)
[[2]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L887-R957)

### Version Updates:
*
[`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L121-R121):
Incremented `LIBPATCH` version from 2 to 3.
*
[`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L64-R65):
Incremented `LIBPATCH` version from 1 to 2.

### Unit Test Enhancements:
*
[`tests/unit/lib/test_opensearch_base_charm.py`](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34L360-R373):
Updated unit tests to mock new methods and ensure they are called
correctly during the CA rotation process.
[[1]](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34L360-R373)
[[2]](diffhunk://#diff-39b4fe0e40f461d4a9731e4cb8281d7a1f6eaccaee15e3b3a4ec796840136d34R383-R388)

### Code Cleanup:
*
[`lib/charms/opensearch/v0/opensearch_base_charm.py`](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L952-L957):
Removed redundant calls to `update_request_ca_bundle` and
`remove_old_ca` in `_start_opensearch` and `_post_start_init`.
[[1]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L952-L957)
[[2]](diffhunk://#diff-5c0d89f838ffd5a9ecf2288b3698320fda2d2b2b737361b6e10adb55d1929f43L1164-L1167)

### Dependency Updates:
*
[`lib/charms/opensearch/v0/opensearch_tls.py`](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4R24):
Added import for `Path` from `pathlib`.
This PR extends the wait_until to search for the units' status of a
given subordinate charm, based on its principal charm.
…cluster (#486)

## Issue
Fixes #482 
If a unit finishes updating its CA before any of the other unit start
the rotation then this unit will update its certificates and that will
cause communication issues with the rest of the nodes

## Solution
Check if there's a `tls_ca_renewing` set on any of the units then
consider the rotation started.

This pull request focuses on enhancing the logic for checking the status
of CA (Certificate Authority) rotation in the `opensearch_tls.py`
module. The changes aim to improve the accuracy and clarity of
determining whether the CA rotation process is complete across all units
in the cluster.

### Enhancements to CA Rotation Logic:

* **Updated Condition Checks:**
- Replaced `is_ca_rotation_ongoing` with
`ca_rotation_complete_in_cluster` to ensure the correct status is
checked before deferring events and updating TLS resources.
(`lib/charms/opensearch/v0/opensearch_tls.py`)
[[1]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L213-R213)
[[2]](diffhunk://#diff-fcbdb57c37f5d32b5ecf046327cfa9ee91a33f156714ea7fb6f000a48d2c4ca4L640-R640)

* **New Method Implementation:**
- Implemented `ca_rotation_complete_in_cluster` to accurately determine
if CA rotation is complete by checking flags across all units. This
method now tracks both ongoing and completed states more effectively.
(`lib/charms/opensearch/v0/opensearch_tls.py`)

* **Removed Redundant Method:**
- Removed the `is_ca_rotation_ongoing` method as its functionality is
now covered by the enhanced `ca_rotation_complete_in_cluster` method.
(`lib/charms/opensearch/v0/opensearch_tls.py`)
## Issue
In Large Deployments with cluster-manager-only nodes and data-only
nodes, a deadlock might appear if the whole data-application has been
powered down and needs to restart. In this case, starting the data node
should be possible without acquiring the lock first, because the lock
index is not available.

## Solution
Start the leader of the data-nodes without acquiring the lock if it was
previously started but is not running anymore.

The PR also adds unit test coverage for the case where a
cluster-manager-only has not initialized the security index and a
failover-orchestrator with data-role joins and wants to start (see
[here](#424 (comment))
for context).
## Issue
Currently the integration test for re-using storage fails occasionally.
This is because in one of the tests we remove the application, keeping
the storage disks and re-attaching them to a new application. This can
lead to stale metadata, as described in [our
docs](https://charmhub.io/opensearch/docs/h-attached-storage), which
will cause the Opensearch service to fail on startup.

## Solution
Adjust the integration test workflow to first scale down to one
remaining unit before removing the application. This will cause the
remaining unit to become the leader, if it wasn't already. Removing the
application now and later re-attaching this units' storage disk to the
new leader means that we can start up Opensearch correctly.
This PR extends the current charm to support the following profiles:
* testing:
* focused in the integration tests and our CI -> 1G RAM dedicated to the
Heap and no automation
* staging:
* HA capabilities must be available: for that, we will enforce index
template that encompasses all the indices and sets replica: 1-all
  * Extends heap to: `max(1G, 10% of RAM)`
  * `indices.memory.index_buffer_size` extends to `25%`
  * Adds three component templates, that will be described later
* production:
* Same features as the staging, but heap is set instead to: `max(1G, 50%
of RAM)`

The options above are set based on the following documents:
https://opensearch.org/docs/latest/tuning-your-cluster/performance/

https://opensearch.org/docs/latest/search-plugins/knn/performance-tuning/

The user can switch between the three options above, and depending on
the selected value, the templates are created or destroyed.
There is a potential race condition within `
opensearch_relation_provider.py`, where we may ask for an user to be set
(and hence, eventually call `update_certs`) before the certificate is
ready. That is more often when we deploy a complete bundle at once.

This PR fixes the issue by allowing `update_certs` to detect the missing
`chain` config in the TLS secrets and throws a `KeyError`.

Closes: #493
)

## Issue
If the Opensearch operator receives a config change before the
deployment description has been initialized, applying the performance
profile will fail and the charm will be in a failed state.

This happens repeatedly on the integration test runs for Opensearch in
the data-integrator CI, e.g.
[here](https://github.com/canonical/data-integrator/actions/runs/11585189263/job/32253824740).

## Solution
If the deployment description does not exist yet, the performance
profile should not be applied and the received event should be deferred.
Our CIs are failing with:
https://github.com/canonical/opensearch-operator/actions/runs/11675736324/job/32510812135

This PR updates `actions/checkout` to v4,  in line with other charms.
Sync charm docs from https://discourse.charmhub.io

Co-authored-by: GitHub Actions <41898282+github-actions[bot]@users.noreply.github.com>
## Issue
Some rules has duplicated names that can be a little bit confusing. 

## Solution
This PR separate into different names to be more clear with the alert
rule expression and adds spaces between alerts to increase readability.
This PR removes one of the filters that was causing a race condition at
CA rotation + large deployments. It fixes the CA rotation for large
deployments in our CI.

Closes #500
This PR extends our current charm to add a terraform module for the
charm. It also adds github workflow to test + a basic test setup.
If a scrape fails, this might indicate that a unit is not in a healthy
state.

OpenSearch right now does not have a metric saying that one node is
down. E.g. If the systemd service is stopped in one node, the cluster (N
nodes) will drop the faulty node because connectivity issues and the
metrics will show that the cluster now has N-1 nodes without saying that
one node has failed.

With this new alert, at least a notification will appear if one node
stop being responsive.


How to test:

- Deploy opensearch units
- Stop the opensearch daemon in one of the units

The grafana-agent injects the juju topology at the alert rule, so the
expression `up < 1` will filter just for OpenSearch apps:


![image](https://github.com/user-attachments/assets/d09b22be-a571-4ec2-b76d-a654186df327)

The alert will trigger:

![image](https://github.com/user-attachments/assets/a1ace958-1188-4b2b-840c-febfd639dd57)
OpenSearch does not have thread pool with the name bulk. However, there
is a field named "write" that can be used by bulk or writing a single
instance.

This PR adds the necessary logic to use the OpenSearch data coming from
the exporter and because the expression is big, it was separated into
smaller expressions using record.

To see the original rule that was inspiration for this alert see this
[repo](https://github.com/lukas-vlcek/prometheus-elasticsearch-rules/blob/master/logging_elasticsearch.rules.yaml)

Fix: #503
When a cluster is in yellow state, it's because some shards are not
active. This can be reached under heavy load, however if there are
shards unassigned and the cluster is yellow, that means that some
replicas are not allocated and it might be necessary to add new nodes in
order to host all shards.

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
@phvalguima phvalguima marked this pull request as ready for review December 13, 2024 15:49
@phvalguima phvalguima merged commit 561fa89 into DPE-4196-improve-plugin-manager Dec 13, 2024
39 of 41 checks passed
@phvalguima phvalguima deleted the rebase-2024.12.07-DPE-4196-improve-plugin-manager branch December 13, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants