Cluster fails to recover from loss of quorum #358

carlcsaposs-canonical · 2024-01-19T10:15:12Z

Steps to reproduce

deploy 3 units mysql-k8s from stable
(optional) relate to mysql-router-k8s from Retry if MySQL Server is unreachable mysql-router-k8s-operator#190
run (if unit 0 is primary)

>>> while True:
...     for pod in (1, 2):
...             subprocess.run(f"kubectl -n foo2 delete pod mysql-k8s-{pod} --force".split())
...     time.sleep(5)

ctrl-c to break while loop
Wait, run jhack ffwd, wait—server doesn't recover

Expected behavior

Server recovers from loss of quorum

Actual behavior

Server stays stuck w/o quorum

Versions

Operating system: Ubuntu 22.04

Juju CLI: 3.1.7-genericlinux-amd64

Juju agent: 3.1.7

Charm revision: 113

microk8s: MicroK8s v1.28.3 revision 6091

Log output

Juju debug log:
no-quorum-stuck-debug-log.txt

unit-mysql-k8s-0: 10:06:33 WARNING unit.mysql-k8s/0.juju-log Failed to get cluster primary addresses
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 666, in _run_mysqlsh_script
    stdout, _ = process.wait_output()
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1441, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2024-01-19T10:06:33Z: Loading startup files...\nverbose: 2024-01-19T10:06:33Z: Loading plugins...\nverbose: 2024-01-19T10:06:33Z: Connecting to MySQL at: clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local\nverbose: 2024-01-19T10:06:33Z: Shell.connect_to_primary: tid=5730: CONNECTED: mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local\nverbose: 2024-01-19T10:06:33Z: Redirecting session from \'mysqlx://clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local:33060\' to a PRIMARY of an InnoDB cluster or ReplicaSet...\nTraceback (most recent call last):\n  File "<string>", line 1, in <module>\nmysqlsh.Error: Shell Error (51011): Shell.connect_to_primary: The InnoDB cluster appears to be under a partial or total outage and an ONLINE PRIMARY cannot be selected. (Group has no quorum)\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/lib/charms/mysql/v0/mysql.py", line 1767, in get_cluster_primary_address
    output = self._run_mysqlsh_script("\n".join(get_cluster_primary_commands))
  File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 669, in _run_mysqlsh_script
    raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2024-01-19T10:06:33Z: Loading startup files...
verbose: 2024-01-19T10:06:33Z: Loading plugins...
verbose: 2024-01-19T10:06:33Z: Connecting to MySQL at: clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local
verbose: 2024-01-19T10:06:33Z: Shell.connect_to_primary: tid=5730: CONNECTED: mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local
verbose: 2024-01-19T10:06:33Z: Redirecting session from 'mysqlx://clusteradmin@mysql-k8s-0.mysql-k8s-endpoints.foo2.svc.cluster.local:33060' to a PRIMARY of an InnoDB cluster or ReplicaSet...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.Error: Shell Error (51011): Shell.connect_to_primary: The InnoDB cluster appears to be under a partial or total outage and an ONLINE PRIMARY cannot be selected. (Group has no quorum)

Additional context

If all pods are deleted (including primary), server usually recovers

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-19T10:15:32Z

https://warthogs.atlassian.net/browse/DPE-3334

carlcsaposs-canonical added the bug Something isn't working label Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster fails to recover from loss of quorum #358

Cluster fails to recover from loss of quorum #358

carlcsaposs-canonical commented Jan 19, 2024

github-actions bot commented Jan 19, 2024

Cluster fails to recover from loss of quorum #358

Cluster fails to recover from loss of quorum #358

Comments

carlcsaposs-canonical commented Jan 19, 2024

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

github-actions bot commented Jan 19, 2024