Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure OpenSearch lock is written to all nodes before using #263

Merged
merged 2 commits into from
Apr 29, 2024

Conversation

carlcsaposs-canonical
Copy link
Contributor

@carlcsaposs-canonical carlcsaposs-canonical commented Apr 25, 2024

By checking that opensearch lock is replicated to all nodes, we should avoid edge cases where we have 2+ online nodes and then a node has a network cut & only sees 1 online node, doesn't have the lock replicated, and requests the peer databag lock since it thinks no unit has the opensearch lock

Potentially fixes no effect on #243

@Mehdi-Bendriss reproduced the issue in #243 and found:

  • unit 0 was cluster manager
  • unit 1 (the one scaling down) had primary shard

which means the issue should not be related to a failing cluster manager election with 2 -> 1 units
so it's most likely that the lock document was not replicated to unit 0

Context: https://chat.canonical.com/canonical/pl/oc797xcddpn53giu6gtfp4sboo

@carlcsaposs-canonical carlcsaposs-canonical merged commit c6d0703 into main Apr 29, 2024
23 of 26 checks passed
@carlcsaposs-canonical carlcsaposs-canonical deleted the wait-for-all-shards branch April 29, 2024 12:43
carlcsaposs-canonical added a commit that referenced this pull request Apr 30, 2024
## Issue
Fixes #263

## Solution
Check if lock exists instead of trying to create lock to check if it exists

When creating lock, we wait for all shards—which does not happen if a unit is offline after it has acquired the lock (i.e. for restart)

This change also requires that we delete the lock if, when we create it, the write doesn't go through on all nodes
carlcsaposs-canonical added a commit that referenced this pull request May 2, 2024
## Issue
Fixes #263

## Solution
Check if lock exists instead of trying to create lock to check if it
exists

When creating lock, we wait for all shards—which does not happen if a
unit is offline after it has acquired the lock (i.e. for restart)

This change also requires that we delete the lock if, when we create it,
the write doesn't go through on all nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants