Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-3573][DPE-3579] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins #177

Merged
merged 1 commit into from
Mar 1, 2024

Conversation

phvalguima
Copy link
Contributor

@phvalguima phvalguima commented Feb 6, 2024

Adds a GET /_cluster/settings, which returns the configuration of the cluster at the moment, including defaults. This way, plugins can check if they should add their configs and request a restart during the start cycle of the charm.

The _is_enabled is also extended to use the cluster settings instead of checking the configuration files directly.

Moved OpenSearchPluginConfig to extend pydantic.BaseModel. It ensures that configuration is translated to Dict[str, str], in special the boolean value, that must have a format of: true or false.

Also closes DPE-3579: move the plugin settings status to use the charm's status.{set,clear} APIs.

@phvalguima phvalguima marked this pull request as ready for review February 6, 2024 00:09
@phvalguima phvalguima changed the title [DPE-3352] last passing tests [DPE-3352] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins Feb 6, 2024
@phvalguima phvalguima marked this pull request as draft February 12, 2024 06:59
@Mehdi-Bendriss
Copy link
Contributor

Mehdi-Bendriss commented Feb 12, 2024

It will be hard to track any side effects introduced by these changes, let's split this PR in 4 separate PRs please:

  1. plugins: rely on _cluster/settings instead of opensearch.yml for plugins:
    • including: prevent restarting randomly at the start of the cluster
    • PR-175 depends on / waits for this fix
    • some tests on the data integrator fail because of it
  2. experiment with synchronous start: this should only be about making the _start_opensearch synchronous (we haven't confirmed this is solving the issue and the impact it has):
    • no deferrals
    • increase of the timeout duration on self.opensearch.start
    • managing the OpenSearchStartTimeoutError, OpenSearchNotFullyReadyError - how to handle in a synchronous scenario?
  3. the restart part (i'm not entirely sure why we need it?)
  4. the rollingops part should be out of the charm - and if changes need to be made, they should be done upstream

I have a couple of questions:

  • why do we need to go through the restart method for anything start related? it seems we are complicating the logic of the restart by making it a single point of entry. (please refer to PR-175 for ideas)
  • I'm not sure I understand what we're trying to solve in the changes in the _on_start hook? I re-read the previous logic, and it seems it was correct as it was prior to the new changes

@phvalguima
Copy link
Contributor Author

phvalguima commented Feb 12, 2024

To clarify: start logic for me is all the logic directly involved in getting the service up and running for the very first time.

Regarding the questions:

why do we need to go through the restart method for anything start related? it seems we are complicating the logic of the restart by making it a single point of entry. (please refer to PR-175 for ideas)

There are some reasons:

  1. The charm-rolling-ops never checks directly for the callback method in the event itself. It sets the callback_override in the relation and then, in a next RunWithLockEvent, it searches in the relation databag for the override. So, if we switch methods, we may have:
  • Hook 1, Event 1: Unit runs: acquire_lock.emit(callback_override=_start_opensearch)
  • Hook 1, Event 1: Unit finishes Event 1
    ...
  • Hook 1, Event X: Unit runs: acquire_lock.emit(callback_override=_restart_opensearch) <--- we have overriden the value in the relation databag
    ...
  • Hook N, Event 1: Unit received "RunWithLockEvent" and recovers the callback: _restart_opensearch is called
    There is a potential race condition here
  1. The start logic is still being executed several hooks later, for each unit. That increases the chance of overlapping _start_opensearch and _restart_opensearch

I'm not sure I understand what we're trying to solve in the changes in the _on_start hook? I re-read the previous logic, and it seems it was correct as it was prior to the new changes

For some reasons: #154 - I find the start logic in the charm has a lot going on. It needs information that only comes later, like certificates. That means we have some "mixing", where certain events of the lifecycle have to happen even before the actual start. That means: we need a way to defer.
The first step: the _on_start gets more of the logic, which also cleans up the _start_opensearch and allows us to have one single callback_override: _restart_opensearch. That first simplification moves some of the defer up to the _on_start. Then, I still have to deal, as you said, with delayed connections. You can see I added tenacity and I have several retrials in the start/stop. Now, these are synchronous.
I cannot make the tests work 100% with simply removing the extra "restart" caused by plugin-manager originally - it randomly fails by never receiving the lock.


Regarding your points:

  1. plugins: repurposed the PR back to focus only in the plugin

  2. experiment with sync start: yes, that was my first trial. But then I noticed it did not resolve the problem - our starting cycle has several stages and we need some way to defer the start until we are ready. That is where I started to work in the RollingOpsManager instead. I am trying to have a retriable locking system.

  3. As mentioned above, I do not want to accidentally mix _start_opensearch and _restart_opensearch. Given a start is just a "restart without stopping", I see no reason to merge both methods.

  4. Ack, done in: charm-rolling-ops#12. That was my 1st iteration on the problem. I will prepare a PR using OpenSearch for the testing. Re:PR#175, if you check the rolling-ops change, I think we should extend "WorkloadLockManager" and overload "can_node_be_safe_stopped", where it would check the output of "wait_for_shards_relocation" to see if it is safe to stop that node.

@phvalguima phvalguima force-pushed the DPE-3352-last-passing-tests branch from 060072f to 2f9fed7 Compare February 12, 2024 19:29
@phvalguima phvalguima marked this pull request as ready for review February 17, 2024 06:27
@phvalguima
Copy link
Contributor Author

The backup error in: https://github.com/canonical/opensearch-operator/actions/runs/7951972946/job/21713806527, is a status cleanup addressed in #186

@phvalguima phvalguima force-pushed the DPE-3352-last-passing-tests branch from 22cb1a8 to c938281 Compare February 19, 2024 10:06
@phvalguima phvalguima changed the base branch from main to gcs-DPE-3578-clear-status-backup February 19, 2024 10:12
@phvalguima phvalguima changed the base branch from gcs-DPE-3578-clear-status-backup to DPE-3578-clear-status-backup February 19, 2024 10:14
juditnovak
juditnovak previously approved these changes Feb 19, 2024
lib/charms/opensearch/v0/opensearch_relation_provider.py Outdated Show resolved Hide resolved
@phvalguima phvalguima force-pushed the DPE-3352-last-passing-tests branch from 670fd98 to 9949059 Compare February 19, 2024 14:18
@phvalguima
Copy link
Contributor Author

The h-scaling-integration failed with a time out:

2024-02-19T19:14:36.5332986Z �[32mINFO    �[0m tests.integration.helpers_deployments:helpers_deployments.py:271 	App: opensearch - Units - conditions unmet.
2024-02-19T19:14:36.5334811Z �[32mINFO    �[0m tests.integration.helpers_deployments:helpers_deployments.py:272 
2024-02-19T19:14:36.5335427Z 	app: opensearch active -- message: None
2024-02-19T19:14:36.5335956Z 		opensearch-3  -- (10.171.13.159) -- [idle (since: 18:44:39)] active: 
2024-02-19T19:14:36.5336557Z 		opensearch-5  -- (10.171.13.194) -- [idle (since: 18:44:39)] active: 
2024-02-19T19:14:36.5337257Z 		opensearch-6  -- (10.171.13.245) -- [idle (since: 18:44:37)] waiting: Awaiting service operation
2024-02-19T19:14:36.5337962Z 		opensearch-7  -- (10.171.13.183) -- [idle (since: 18:44:39)] active: 
2024-02-19T19:14:36.5338550Z 		opensearch-8* -- (10.171.13.236) -- [idle (since: 18:44:48)] active: 
2024-02-19T19:14:36.5338885Z 
2024-02-19T19:14:36.5339362Z �[1m�[31mERROR   �[0m tests.integration.helpers_deployments:helpers_deployments.py:346 wait_until -- Timed out!

@phvalguima phvalguima changed the title [DPE-3352] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins [DPE-3352][DPE-3579] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins Feb 20, 2024
@phvalguima phvalguima changed the title [DPE-3352][DPE-3579] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins [DPE-3353][DPE-3579] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins Feb 21, 2024
@phvalguima phvalguima changed the title [DPE-3353][DPE-3579] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins [DPE-3573][DPE-3579] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins Feb 21, 2024
Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pedro - I left some questions and comments

lib/charms/opensearch/v0/opensearch_plugin_manager.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_plugin_manager.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_plugin_manager.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_plugin_manager.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_base_charm.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/helper_cluster.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_plugin_manager.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_plugin_manager.py Outdated Show resolved Hide resolved
lib/charms/opensearch/v0/opensearch_plugin_manager.py Outdated Show resolved Hide resolved
@phvalguima phvalguima force-pushed the DPE-3352-last-passing-tests branch 2 times, most recently from 7114625 to 2f7dd8d Compare February 23, 2024 10:18
@phvalguima phvalguima force-pushed the DPE-3352-last-passing-tests branch from 27d7db4 to 5331546 Compare February 29, 2024 03:17
Base automatically changed from DPE-3578-clear-status-backup to main February 29, 2024 08:35
@phvalguima phvalguima dismissed juditnovak’s stale review February 29, 2024 08:35

The base branch was changed.

@phvalguima phvalguima force-pushed the DPE-3352-last-passing-tests branch from a21604b to cfdc018 Compare February 29, 2024 11:03
@phvalguima phvalguima merged commit f1d9d37 into main Mar 1, 2024
15 of 17 checks passed
@phvalguima phvalguima deleted the DPE-3352-last-passing-tests branch March 1, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants