Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling 3 -> 0 -> 3 results in cluster stuck waiting for TLS / restart #496

Open
phvalguima opened this issue Oct 30, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@phvalguima
Copy link
Contributor

It seems there is a functional difference between remove app / redeploy and scaling from 3->0->3.

The result is a cluster stuck on (re)initializing:

$ juju status
Model       Controller        Cloud/Region      Version  SLA          Timestamp
opensearch  azure-westeurope  azure/westeurope  3.4.4    unsupported  11:21:15Z

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
opensearch                         active       3  opensearch                2/edge         185  no       
opensearch-dashboards              blocked      1  opensearch-dashboards     2/stable        22  no       Opensearch service is (partially or fully) down
self-signed-certificates           active       1  self-signed-certificates  latest/stable  155  no       
sysconfig                          active       3  sysconfig                 latest/stable   33  no       ready

Unit                         Workload     Agent      Machine  Public address  Ports     Message
opensearch-dashboards/0*     blocked      idle       1        172.18.0.14     5601/tcp  Opensearch service is (partially or fully) down
opensearch/6*                waiting      executing  8        172.18.0.19               Waiting for OpenSearch to start...
  sysconfig/67               active       idle                172.18.0.19               ready
opensearch/7                 maintenance  executing  9        172.18.0.20               Waiting for TLS to be fully configured...
  sysconfig/68               active       idle                172.18.0.20               ready
opensearch/8                 maintenance  executing  10       172.18.0.18               Waiting for TLS to be fully configured...
  sysconfig/66*              active       idle                172.18.0.18               ready
self-signed-certificates/0*  active       idle       0        172.18.0.15               

Machine  State    Address      Inst id             Base          AZ  Message
0        started  172.18.0.15  juju-a49dc1-0       [email protected]      
1        started  172.18.0.14  juju-a49dc1-1       [email protected]      
8        started  172.18.0.19  manual:172.18.0.19  [email protected]      Manually provisioned machine
9        started  172.18.0.20  manual:172.18.0.20  [email protected]      Manually provisioned machine
10       started  172.18.0.18  manual:172.18.0.18  [email protected]      Manually provisioned machine

In the later case, it is possible to see, in the app-level databag, stale data from the older 3x unit that were removed. For example, in this show-unit, we can see: https://pastebin.canonical.com/p/VC7vCvvPSN/, we can still see data from units that were gone, such as opensearch/0:

...

  - relation-id: 2
    endpoint: opensearch-peers
    related-endpoint: opensearch-peers
    application-data:
      admin_user_initialized: "True"
      allocation-exclusions-to-delete: opensearch-4.715,opensearch-1.715
      bootstrap_contributors_count: "3"
      bootstrapped: "True"
      client_relation_users: '{}'
      delete-voting-exclusions: opensearch-4.715,opensearch-1.715
      deployment-description: '{"app": {"id": "65b68abb-a725-4dfa-895d-13d396a49dc1/opensearch",
        "model_uuid": "65b68abb-a725-4dfa-895d-13d396a49dc1", "name": "opensearch",
        "short_id": "715"}, "config": {"cluster_name": "opensearch-hucz", "data_temperature":
        null, "init_hold": false, "profile": "production", "roles": []}, "pending_directives":
        [], "promotion_time": 1729778096.196239, "start": "start-with-generated-roles",
        "state": {"message": "", "value": "active"}, "typ": "main-orchestrator"}'
      nodes_config: '{"opensearch-0.715": {"app": {"id": "65b68abb-a725-4dfa-895d-13d396a49dc1/opensearch",
        "model_uuid": "65b68abb-a725-4dfa-895d-13d396a49dc1", "name": "opensearch",
        "short_id": "715"}, "ip": "172.18.0.12", "name": "opensearch-0.715", "roles":
        ["data", "ingest", "ml", "cluster_manager"], "temperature": null, "unit_number":
        0}, "opensearch-2.715": {"app": {"id": "65b68abb-a725-4dfa-895d-13d396a49dc1/opensearch",
        "model_uuid": "65b68abb-a725-4dfa-895d-13d396a49dc1", "name": "opensearch",
        "short_id": "715"}, "ip": "172.18.0.13", "name": "opensearch-2.715", "roles":
        ["data", "ingest", "ml", "cluster_manager"], "temperature": null, "unit_number":
        2}}'
@phvalguima phvalguima added the bug Something isn't working label Oct 30, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5758.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant