[DPE-4196] Plugin Management Refactor #435

phvalguima · 2024-09-10T16:56:39Z

The main goal is to: (1) separate API call (/_cluster) between plugin-specific calls and cluster settings; (2) curb the requirements for restart for any plugin changes; (3) get a faster response time using cached entries wherever possible; (4) add new use-case with secret handling within the plugin logic itself; and (5) define models so we can standardize the plugin data exchanged between the objects and via relation.

Dev experience

The idea is to make it easier to add management for separated plugins without impacting the rest of the code. A dev willing to add a new plugin must decide: do we need to manage a relation or just config options on the charm?
If not, then we can add a config-only plugin
If yes, then we will need new a new object to encapsulate the plugin handling: the "DataProvider"

The config-only plugin

These are plugin configured via config options. In this case, it is only needed to add a new OpenSearchPlugin -child class that manages the config options to be added or removed from the cluster.

For example, opensearch-knn receives the config from the charm and returns the options to be set in the opensearch.yml.

The relation-based plugins

These plugins are more elaborate, as they have to process events specific to the given plugin. We also must consider the case of large deployments, where data may come via dedicated relation or the peer-cluster relation.

These plugins should be managed by a separate entity, named the relation manager. Defining a common structure for the relation manager is outside of the scope of this PR.

For example, repository-s3 and OpenSearch backup.

New Plugin Manager Infra

Now, the plugin manager is able to manage plugins that depend on config options, API calls and secrets. Whenever adding a new plugin, we should consider:

opensearch_plugins.py: this plugin should have a representation that is consumable by plugin_manager; it should be composed of all the configurations and keys to be added or removed to the cluster's main configuration
opensearch_plugin_manager.py: add the new plugin to the plugin dict; the manager must be able to instantiate this new plugin
opensearch_{plugin-name}.py: if the plugin is managed by a given relation, this lib will implement the relation manager and interface with OpenSearch's plugin-specific APIs
models.py: add any relation data model to this lib
Using the new plugin data provider
While specific classes takes care of the plugin's API calls (e.g. /_snapshot API for the backup plugin is done by OpenSearchBackup class), the data provider facilitates the exchange of relation data between the specific class and the plugin_manager itself. This way, the plugin manager can apply any cluster-wide configurations that are needed for that plugin.

We need a class to do deal with relation specifics as some plugins may expect different relations depending on their deployment description, e.g. OpenSearchBackupPlugin. The OpenSearchPluginDataProvider encapsulates that logic away from the main plugin classes.

Secret Management

Each plugin that handles the specific secrets must implement the secret management logic in its operation. The goal is to avoid filling the opensearch_secrets.py methods with ifs for each plugin case and separating / isolating each plugin code.

Remove unneeded restarts and add caching

We ensure that any configuration changes that come from plugin management are applied via API before being persisted on config files. If the API responds with a 200 status, then we should only write the new value to the configuration and finish without a need for restart.

In case the service is down and API is not available, we can assume we will eventually start the service back up. In this case, it suffices to write the config entries to the files and leave to the next start to pick them up.

This task is going to be divided into 3x parts:

Addresses low ranging fruits where we reduce the number of restarts and add caching support
Three main actions: (i) Merge {add,delete}_plugin together and its equivalents in OpenSearchPluginConfig class; (ii) we receive one big dictionary where a key: None means we want to simply delete that entry; and (iii) the main OpenSearchKeystore must observe secret changes and update its values accordingly
Returns unit tests: this is going to be commented out whilst Parts 1 and 2 happen, given this part of the code was covered with extensive testing
The current implementation of plugin_manager.run waits for the cluster to be started before processing its config changed. We relax this demand and open to the option where the cluster is not yet ready, so we can modify the configuration without issuing a restart request.

#252 is closed with OpenSearchPluginRelationsHandler interface. It allows plugins to define how they will handle its relation(s). opensearch_backup module extends this interface and defines a checker to process either small or large deployments details.

Other relevant changes:

Renaming method check_plugin_manager_ready to check_plugin_manager_ready_for_api
Any plugin that needs to manage things via API call should check the health of the cluster using check_plugin_manager_ready_for_api
Moving opensearch_distro.version to load the workload_version file we have present instead of an API call: this is two fold, 1. removes the dependency to a cluster to be ready and 2. makes this method in-sync with recent changes for upgrades logic
Waive the need of loading the default settings if this particular unit is powered down: which makes sense, in this moment we can do any config changes as we will eventually powered it back up later
If /_cluster/settings is available: apply the configs via API and do not add a restart request
On config-changed handler, the upgrade_in_progress check gets precedence and will continuously defer config-changed event until upgrade is finished before calling the plugin manager
Create a OpenSearchKeyStoreNotReadyYetError: responsible to identify the keystore has not been initialized yet across the cluster and hence, we cannot manage any plugins that use it; however, we always apply the opensearch.yml changes from that plugin
Add cached_property whenever it makes sense, also adds logic to clean the cache if there was any relevant changes to its content.
That still frees the config_changed to just call plugin_manager.run() before everything is set, as the run() method changes hard configuration only.

Closes #252, #280, #244

…run_cmd on Keystore class

…er-change-before-started-set

phvalguima · 2024-09-13T12:23:53Z

lib/charms/opensearch/v0/opensearch_base_charm.py

                HealthColors.GREEN,
                HealthColors.IGNORE,
            ]:
-                event.defer()


This defer brings no real benefit, as another update-status will forcefully happen.

what if the next event is a non update status related event?

Well, there are a few answers here:

Update status can be set to arbitrarily long values or even disabled, hence, we should never be doing here anything other than keeping the status

On an one year span, update-status will outpace any other type of event, if we start deferring them we may have multiple update-status happening at once

The update-status sets a time threshhold, if our hooks take too long, then we accumulate update-status. If our update-status is taking too long (e.g. because of deferred previous update-status) then we will have an endless loop

What is the benefit this deferral brings?

…nager

lib/charms/opensearch/v0/opensearch_backups.py

phvalguima · 2024-09-13T14:39:10Z

tests/helpers.py

 from cryptography import x509
 from cryptography.hazmat.primitives import hashes, serialization
 from cryptography.hazmat.primitives.asymmetric import rsa
 from cryptography.x509.oid import NameOID


+def patch_wait_fixed() -> Callable:


Speeds up tenacity by replacing the wait for a smaller range one

…s if they are internal

phvalguima · 2024-09-14T08:14:28Z

lib/charms/opensearch/v0/opensearch_backups.py

-        if self.charm.unit.is_leader():
-            self.charm.status.clear(BackupSetupFailed, app=True)
+            self.charm.status.set(BlockedStatus(BackupSetupFailed))
+            self.charm.status.set(BlockedStatus(BackupSetupFailed), app=True)


Following conversation on: https://chat.canonical.com/canonical/pl/oqwsfdejufgepettcqzobnw7xh

It is neither guaranteed nor wanted by the entire team to support a: all units in a given status -> app status updated automatically

phvalguima · 2024-09-14T09:13:04Z

lib/charms/opensearch/v0/opensearch_keystore.py

@@ -34,46 +34,25 @@ class OpenSearchKeystoreError(OpenSearchError):
    """Exception thrown when an opensearch keystore is invalid."""


+class OpenSearchKeystoreNotReadyYetError(OpenSearchKeystoreError):
+    """Exception thrown when the keystore is not ready yet."""


This error is thrown when we try to reach out to the to reload the keys via API and it fails.

Sync charm docs from https://discourse.charmhub.io Co-authored-by: a-velasco <[email protected]>

Currently, we are having a lot of time outs in CA rotation testing. Breaking between small and large deployments and having parallel runners will help with that overall duration.

…python-3-12' into DPE-4196-improve-plugin-manager

zmraul

Huge amount of work on this PR. Nice!

I've been running this branch locally and didn't find issues so far. Left some comments.

zmraul · 2024-12-02T10:23:43Z

lib/charms/opensearch/v0/opensearch_plugins.py

+class OpenSearchPluginDataProvider:
+    """Implements the data provider for any charm-related data access.
+
+    Plugins may have one or more relations tied to them. This abstract class
+    enables different modules to implement a class that can specify which
+    relations should plugin manager listen to.
+    """


question: Does this class include large deployment relation?

Not directly, but that is the idea. This class exists to abstract the access to the databags from basic plugins. The plugin should be "dumb", i.e. just v basic dataclasses. This class provides the plugins with any relation info.

zmraul · 2024-12-02T10:31:09Z

lib/charms/opensearch/v0/opensearch_plugins.py

@@ -452,66 +503,141 @@ def name(self) -> str:
        return "opensearch-knn"


+class OpenSearchPluginBackupDataProvider(OpenSearchPluginDataProvider):


nit: I would personally remove most of the OpenSearch prefix on these classes, since they are somewhat redundant and create noise when parsing or searching the file.

lib/charms/opensearch/v0/opensearch_keystore.py

zmraul · 2024-12-02T22:49:33Z

lib/charms/opensearch/v0/opensearch_backups.py

+        for event in [
+            charm.on[PeerClusterRelationName].relation_joined,
+            charm.on[PeerClusterRelationName].relation_changed,
+            charm.on[PeerClusterRelationName].relation_departed,


question Is relation_departed needed?

Yes. Reason: if you have s3 keys in the opensearch-keystore but you do not have backup configured, then the application breaks. Therefore, we need to know when the relation is gone.

zmraul · 2024-12-03T08:18:02Z

lib/charms/opensearch/v0/opensearch_backups.py

-        except OpenSearchHttpError as e:
-            return e.response_body if e.response_body else None
-        return result if isinstance(result, dict) else None
-
    def _is_restore_in_progress(self) -> bool:


nit: should this one be public as well? same as is_backup_in_progress

zmraul · 2024-12-03T08:25:04Z

lib/charms/opensearch/v0/opensearch_backups.py

    elif charm.opensearch_peer_cm.deployment_desc().typ == DeploymentType.MAIN_ORCHESTRATOR:
+        # Using the deployment_desc() method instead of is_provider()
+        # In both cases: (1) small deployments or (2) large deployments where this cluster is the
+        # main orchestrator, we want to instantiate the OpenSearchBackup class.
        return OpenSearchBackup(charm)


question: shouldn't this one also return OpenSearchBackup when type is DeploymentType.FAILOVER_ORCHESTRATOR?

zmraul · 2024-12-03T09:12:16Z

lib/charms/opensearch/v0/opensearch_plugins.py

            raise OpenSearchPluginMissingConfigError(
-                "Plugin {} missing: {}".format(
+                "Plugin {} missing credentials".format(


This message is not informative enough. When integrating with s3, you can have both access_key and secret_key set on s3 integrator and still getting this message. After setting those, and adding eg bucket you get the more informative message from below that shows the missing fields.

zmraul · 2024-12-03T09:17:03Z

lib/charms/opensearch/v0/opensearch_plugin_manager.py

-            and self._charm.health.get()
-            in [HealthColors.GREEN, HealthColors.YELLOW, HealthColors.IGNORE]
-        )
+    def is_ready_for_api(self) -> bool:


More of a design opinion here, but these checks should probably be centralized at some point. They are invoked from outside this file, which means that it is a relevant check for other components.

@zmraul can you give more precise references? AFAIU other places are asking if the opensearch is healthy, I am asking if it is responsive.

zmraul · 2024-12-03T09:25:30Z

lib/charms/opensearch/v0/opensearch_plugin_manager.py

        except OpenSearchCmdError as e:
            if "not found" in str(e):
                logger.info(f"Plugin {plugin.name} to be deleted, not found. Continuing...")
                return False
            raise OpenSearchPluginRemoveError(plugin.name)
        return True

+    def _clean_cache_if_needed(self):


nit: Not sure why this is needed. Clearing the cache on self.plugins will lead to ConfigExposedPlugins being read again, which is static, so no benefit to clearing. self.plugins is being used as a read_only property as far as I can see.

For _installed_plugins I would make that a @property instead, and just evaluate every time.

zmraul · 2024-12-03T09:41:02Z

lib/charms/opensearch/v0/opensearch_plugin_manager.py


    def run(self) -> bool:
        """Runs a check on each plugin: install, execute config changes or remove.

        This method should be called at config-changed event. Returns if needed restart.
        """
+        is_manager_ready = True


question: Shouldn't this method be gated by is_ready_for_api? It is calling API commands on apply and it's called from base-charm.

Maybe this is more a naming issue. The idea here was more is_keystore_ready.

Changed the name, check if that makes more sense.

Mehdi-Bendriss

Thanks Pedro. I left some comments.
It seems to me that we are making the plugin components too smart as opposed to the previous implementation, which I believe was clearer and had better separation of concerns.
The large deployments workflow needs to be carefully analyzed

lib/charms/opensearch/v0/models.py

Mehdi-Bendriss · 2024-12-03T23:09:20Z

lib/charms/opensearch/v0/models.py

+    protocol: Optional[str] = None
+    storage_class: Optional[str] = Field(alias="storage-class")
+    tls_ca_chain: Optional[str] = Field(alias="tls-ca-chain")
+    credentials: S3RelDataCredentials = Field(alias=S3_CREDENTIALS, default=S3RelDataCredentials())


can you excplain the default here?

Missing configs from the s3-integrator

Mehdi-Bendriss · 2024-12-03T23:11:52Z

lib/charms/opensearch/v0/opensearch_backups.py

@@ -220,48 +226,50 @@ def __init__(self, charm: "OpenSearchBaseCharm", relation_name: str = PeerCluste
        ]:
            self.framework.observe(event, self._on_s3_relation_action)

+    def _on_secret_changed(self, event: EventBase) -> None:


should be marked as abstract

Mehdi-Bendriss · 2024-12-03T23:12:14Z

lib/charms/opensearch/v0/opensearch_backups.py

+    @abstractmethod
+    def _on_s3_relation_broken(self, event: EventBase) -> None:
+        """Defers the s3 relation broken events."""
+        raise NotImplementedError


a pass or ... would be better here

Mehdi-Bendriss · 2024-12-03T23:13:47Z

lib/charms/opensearch/v0/opensearch_backups.py

+            # Defaults to True if we have a failure, to avoid any actions due to
+            # intermittent connection issues.
+            logger.warning(
+                "_is_restore_in_progress: failed to get indices status"
+                " - assuming restore is in progress"
+            )
+            return True


I'm not sure I understand this path, why return true and assume a restore is in progress when it may not be, instead of let it crash?

Mehdi-Bendriss · 2024-12-04T00:07:35Z

lib/charms/opensearch/v0/opensearch_plugins.py

+    MANDATORY_CONFS = [
+        "bucket",
+        "endpoint",
+        "region",
+        "base_path",
+        "protocol",
+        "credentials",
+    ]


similar comment on duplication, pydantic should already perform the required validation.

Mehdi-Bendriss · 2024-12-04T00:08:03Z

lib/charms/opensearch/v0/opensearch_plugins.py

+        "protocol",
+        "credentials",
+    ]
+    DATA_PROVIDER = OpenSearchPluginBackupDataProvider


why a class variable?

They are very tightly coupled. The idea is to have each subclass defining its own provider type. To simplify it, I am stating a class and then the upstream classes creating the plugin do not need to worry to create 2x objects instead of one.

Mehdi-Bendriss · 2024-12-04T00:08:20Z

lib/charms/opensearch/v0/opensearch_plugins.py

    """

+    MODEL = S3RelData


why a class variable?

Mehdi-Bendriss · 2024-12-04T00:10:29Z

lib/charms/opensearch/v0/opensearch_relation_peer_cluster.py

+            self.charm.secrets.put_object(
+                Scope.APP,
+                S3_CREDENTIALS,
+                S3RelDataCredentials().to_dict(by_alias=True),
+            )



can you explain why?

This is the case we are missing s3 information, then we just create an empty object to signal that.

Mehdi-Bendriss · 2024-12-04T00:11:57Z

lib/charms/opensearch/v0/opensearch_plugins.py

                )
            )

+        if self.dp.is_main_orchestrator:


this is too smart for the plugins - which is supposed to be dumb. The heavy lifting should have been on the data provider and only inject to the plugin what it needs.

Yeah, I agree. But the thing here is that I need some "factory" method that differs between MAIN or OTHERS...

… Keystore class

There were several changes on the config-changed logic and made this rebase rather complex. I am making a PR to the main refactor branch so we can look at it more carefully before having it all together.

…ugin-manager

- If OpenSearch is throttling, this is an alert that optimizations are necessary like scaling the number of nodes or changing queries and indexing patterns

phvalguima added 30 commits June 14, 2024 18:12

Add support for settings API and avoid restarts as much as possible

c3bb8b0

comment unit tests for now

577eeee

Fixes for for plugin_manager

d85c5fb

Add handler class to manage the plugin relations

897f8a5

Fix if logic

c1b1c0e

Fix cached clean-up

0934290

Add fixes for test_plugins.py

66a5a76

Fix test_charms

b53b362

Fix check_plugin_manager_ready

eaf503f

Fixes for large deployments scenario

a904019

Updates following investigation on large deployments

8b3f15c

Move status.clear to be at the end of config_changed

ec15f2e

Possible large deployments relation fix

2408c88

Fix large deployments

61c8753

Convert S3 data struct to dict

3ef1db7

Convert S3 data struct to dict

74ebaa2

Add first batch of review changes

e2fffbd

Update status with set instead of clear(..., new_status) and move to …

7b9d488

…run_cmd on Keystore class

More review changes

43306b2

1st stage merge

4690454

First batch of changes to move away from _to_add/_to_del

fb90ce4

update unit tests and fix issues

69d1e20

Fixed unit tests for this change

3c326c2

Latest merge with main

dde3dd8

Update ci.yaml

dc3bceb

Update plugins and fix missing config exception

512dae2

fix lint

bc8ef5a

Merge remote-tracking branch 'origin/main' into DPE-4251-plugin-manag…

a34a09d

…er-change-before-started-set

Fix lint and unit tests

b11f1c6

Move to deployment_desc() instead

200cdfa

phvalguima added 2 commits September 13, 2024 11:46

Fix setup failure message to be shown in app level

8ccd54d

Add blocked as the leader unit gets blocked as well

b873ce5

phvalguima commented Sep 13, 2024

View reviewed changes

phvalguima added 5 commits September 13, 2024 14:24

Reset the comment position

c6bcc64

Merge remote-tracking branch 'origin' into DPE-4196-improve-plugin-ma…

2e0cb02

…nager

Fixes post-merge for unit test

7a9a17f

Remove the _request proxy method and fix unit tests post main merge

bc49845

fix lint

1733693

phvalguima commented Sep 13, 2024

View reviewed changes

lib/charms/opensearch/v0/opensearch_backups.py Show resolved Hide resolved

phvalguima commented Sep 13, 2024

View reviewed changes

Add try/except catches for each request() call or its upstream method…

182b9f1

…s if they are internal

phvalguima commented Sep 14, 2024

View reviewed changes

Fix the return code for is_idle method

8324553

phvalguima marked this pull request as ready for review September 14, 2024 11:49

github-actions bot and others added 6 commits September 26, 2024 07:18

Sync docs from Discourse (#451)

a62f180

Sync charm docs from https://discourse.charmhub.io Co-authored-by: a-velasco <[email protected]>

[DPE-5558] Break CA rotation into integration test groups (#458)

c9edade

Currently, we are having a lot of time outs in CA rotation testing. Breaking between small and large deployments and having parallel runners will help with that overall duration.

fix unit tests for TLS in python 3.12

345999d

Rebase to 2/edge

5ad3cc5

Merge branch '2/edge' into DPE-4196-improve-plugin-manager

fb03837

Merge branch 'DPE-5602-tls-unit-tests-are-failing-when-executed-with-…

ec07a49

…python-3-12' into DPE-4196-improve-plugin-manager

Mehdi-Bendriss requested a review from zmraul October 31, 2024 16:07

zmraul reviewed Dec 3, 2024

View reviewed changes

Mehdi-Bendriss reviewed Dec 4, 2024

View reviewed changes

phvalguima added 4 commits December 7, 2024 09:35

Change name from to in plugin manager, remove unused methods and from…

23a1c24

… Keystore class

Remove defer() from the big peer-cluster event

0cb418d

[DPE-4196] Rebase the plugin refactor (#520)

561fa89

There were several changes on the config-changed logic and made this rebase rather complex. I am making a PR to the main refactor branch so we can look at it more carefully before having it all together.

Merge remote-tracking branch 'origin/2/edge' into DPE-4196-improve-pl…

c718443

…ugin-manager

phvalguima changed the base branch from main to 2/edge December 13, 2024 16:01

Add new alert rules for throttling (#509)

77d9121

- If OpenSearch is throttling, this is an alert that optimizations are necessary like scaling the number of nodes or changing queries and indexing patterns

		@@ -452,66 +503,141 @@ def name(self) -> str:
		return "opensearch-knn"


		class OpenSearchPluginBackupDataProvider(OpenSearchPluginDataProvider):

[DPE-4196] Plugin Management Refactor #435

Are you sure you want to change the base?

[DPE-4196] Plugin Management Refactor #435

Conversation

phvalguima commented Sep 10, 2024 • edited Loading

Dev experience

The config-only plugin

The relation-based plugins

New Plugin Manager Infra

Secret Management

Remove unneeded restarts and add caching

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmraul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phvalguima Dec 7, 2024 • edited Loading

Choose a reason for hiding this comment

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phvalguima commented Sep 10, 2024 •

edited

Loading

phvalguima Dec 7, 2024 •

edited

Loading