Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Deployments - Failover fails to initialize the security index #518

Open
skourta opened this issue Dec 5, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@skourta
Copy link
Contributor

skourta commented Dec 5, 2024

Steps to reproduce

  1. Deploy a large deployment:
    • Main App: cluster_manager role only
    • Failover: cluster_manager and data

Expected behavior

The large deployment boots up correctly.

Actual behavior

Failover is stuck at initializing the security index.
image (2)

Versions

Operating system: Ubuntu 24.04.1 LTS

Juju CLI: 3.6.0-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 2/edge branch

LXD: 5.21.2 LTS

@skourta skourta added the bug Something isn't working label Dec 5, 2024
@reneradoi
Copy link
Contributor

I assume shis issue happens because the failover-unit has itself configured in the unicast_host file (as a config manager), but does not get initial_cluster_manager_nodes configured for itself:

[opensearch-failover-2.890] cluster-manager not discovered yet, this node has not previously joined a bootstrapped cluster, and [cluster.initial_cluster_manager_nodes] is empty on this node: have discovered [{opensearch-failover-2.890}{IHiqih9TSdysSYpXudpl3A}{VfR2BJfCTuOiiUSSjFZeeA}{10.54.237.136}{10.54.237.136:9300}{dm}{shard_indexing_pressure_enabled=true, app_id=ecd37465-df44-4398-866f-ec3a6877af2d/opensearch-failover}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305] from hosts providers and [{opensearch-failover-2.890}{IHiqih9TSdysSYpXudpl3A}{VfR2BJfCTuOiiUSSjFZeeA}{10.54.237.136}{10.54.237.136:9300}{dm}{shard_indexing_pressure_enabled=true, app_id=ecd37465-df44-4398-866f-ec3a6877af2d/opensearch-failover}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

This is configured in OpenSearchConfig.set_node() here if a unit has the cluster_manager role (which it does) and contributes to bootstrap (which it does not because only MAIN_ORCHESTRATORS are, see here).

In order to solve this issue, a decision needs to be made if:
a) this kind of deployment setup is considered valid (Main App: cluster_manager role only, Failover: cluster_manager and data) and, if so:
b) if FAILOVER_ORCHESTRATORS should also contribute to bootstrapping in this kind of deployment.

Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6158.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants