Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysql is failing to get endpoints from cluster status #534

Open
natalytvinova opened this issue Nov 29, 2024 · 2 comments
Open

Mysql is failing to get endpoints from cluster status #534

natalytvinova opened this issue Nov 29, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@natalytvinova
Copy link

Steps to reproduce

  1. Currently there is no way we found to reproduce it. Please see the bundle attached.
    bundle.yaml.txt

Expected behavior

Mysql is able to get cluster endpoints

Actual behavior

Mysql is failing to get endpoints from cluster status

unit-kfp-db-0: 01:36:15 ERROR unit.kfp-db/0.juju-log database-peers:6: Failed to get cluster status for kfp-db-cluster
unit-kfp-db-0: 01:36:15 ERROR unit.kfp-db/0.juju-log database-peers:6: Failed to get cluster endpoints
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/src/mysql_k8s_helpers.py", line 786, in update_endpoints
    rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 724, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/lib/charms/mysql/v0/mysql.py", line 1872, in get_cluster_endpoints
    raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-kfp-db-0: 01:36:16 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:36:17 ERROR unit.kfp-db/0.juju-log database-peers:6: Failed to get cluster status for kfp-db-cluster
unit-kfp-db-0: 01:36:17 ERROR unit.kfp-db/0.juju-log database-peers:6: Failed to get cluster endpoints
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/src/mysql_k8s_helpers.py", line 786, in update_endpoints
    rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 724, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/lib/charms/mysql/v0/mysql.py", line 1872, in get_cluster_endpoints
    raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-kfp-db-0: 01:36:18 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:36:19 ERROR unit.kfp-db/0.juju-log database-peers:6: Failed to get cluster status for kfp-db-cluster
unit-kfp-db-0: 01:36:19 ERROR unit.kfp-db/0.juju-log database-peers:6: Failed to get cluster endpoints
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/src/mysql_k8s_helpers.py", line 786, in update_endpoints
    rw_endpoints, ro_endpoints, offline = self.get_cluster_endpoints(get_ips=False)
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 724, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-kfp-db-0/charm/lib/charms/mysql/v0/mysql.py", line 1872, in get_cluster_endpoints
    raise MySQLGetClusterEndpointsError("Failed to get endpoints from cluster status")
charms.mysql.v0.mysql.MySQLGetClusterEndpointsError: Failed to get endpoints from cluster status
unit-kfp-db-0: 01:36:19 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:40:17 INFO unit.kfp-db/0.juju-log Unit workload member-state is online with member-role secondary
unit-kfp-db-0: 01:40:35 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:40:37 INFO unit.kfp-db/0.juju-log database-peers:6: Starting the log rotate manager
unit-kfp-db-0: 01:40:37 INFO unit.kfp-db/0.juju-log database-peers:6: Started log rotate manager process with PID 1129
unit-kfp-db-0: 01:40:39 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:40:43 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:44:24 INFO unit.kfp-db/0.juju-log Unit workload member-state is online with member-role secondary

Versions

Operating system: Ubuntu 22.04.5 LTS
Juju CLI: 3.5.4-genericlinux-amd64
Juju agent: 3.5.4
Charm revision: 8.0/stable rev 180
Kubernetes version: 1.29.9

Log output

After the issue appeared 3 times we enabled the debug-log, but so far couldn't reproduce it with debug-log enabled.
Also there are no pod restarts appearing

$ kubectl get po -n kubeflow
NAME                                             READY   STATUS    RESTARTS   AGE
admission-webhook-0                              2/2     Running   0          3d15h
argo-controller-0                                2/2     Running   0          3d15h
dex-auth-0                                       2/2     Running   0          3d15h
envoy-0                                          2/2     Running   0          3d15h
grafana-agent-k8s-0                              2/2     Running   0          3d15h
istio-ingressgateway-0                           1/1     Running   0          3d15h
istio-ingressgateway-workload-7cbbfc6679-2t24b   1/1     Running   0          3d16h
istio-ingressgateway-workload-7cbbfc6679-nqrdr   1/1     Running   0          3d16h
istio-ingressgateway-workload-7cbbfc6679-rgnld   1/1     Running   0          3d15h
istio-pilot-0                                    1/1     Running   0          3d15h
istiod-6bc5bc58b4-24g4j                          1/1     Running   0          3d16h
jupyter-controller-0                             2/2     Running   0          3d15h
jupyter-ui-0                                     2/2     Running   0          3d15h
katib-controller-0                               2/2     Running   0          3d15h
katib-db-0                                       2/2     Running   0          3d15h
katib-db-1                                       2/2     Running   0          3d16h
katib-db-2                                       2/2     Running   0          3d16h
katib-db-manager-0                               2/2     Running   0          3d16h
katib-ui-0                                       2/2     Running   0          3d16h
kfp-api-0                                        2/2     Running   0          3d15h
kfp-db-0                                         2/2     Running   0          3d12h
kfp-db-1                                         2/2     Running   0          3d16h
kfp-db-2                                         2/2     Running   0          3d12h
kfp-metadata-writer-0                            2/2     Running   0          3d15h
kfp-persistence-0                                2/2     Running   0          3d15h
kfp-profile-controller-0                         2/2     Running   0          3d15h
kfp-schedwf-0                                    2/2     Running   0          3d15h
kfp-ui-0                                         2/2     Running   0          3d15h
kfp-viewer-0                                     2/2     Running   0          3d15h
kfp-viz-0                                        2/2     Running   0          3d15h
knative-eventing-0                               1/1     Running   0          3d15h
knative-operator-0                               3/3     Running   0          3d15h
knative-serving-0                                1/1     Running   0          3d15h
kserve-controller-0                              3/3     Running   0          3d15h
kubeflow-dashboard-0                             2/2     Running   0          3d15h
kubeflow-profiles-0                              3/3     Running   0          3d15h
kubeflow-roles-0                                 1/1     Running   0          3d15h
kubeflow-volumes-0                               2/2     Running   0          3d15h
metacontroller-operator-0                        1/1     Running   0          3d15h
metacontroller-operator-charm-0                  1/1     Running   0          3d15h
minio-0                                          1/1     Running   0          3d15h
minio-operator-0                                 1/1     Running   0          3d15h
mlflow-minio-0                                   1/1     Running   0          3d15h
mlflow-minio-operator-0                          1/1     Running   0          3d15h
mlflow-mysql-0                                   2/2     Running   0          3d12h
mlflow-mysql-1                                   2/2     Running   0          3d12h
mlflow-mysql-2                                   2/2     Running   0          3d12h
mlflow-server-0                                  3/3     Running   0          3d15h
mlmd-0                                           2/2     Running   0          3d15h
modeloperator-6fc7f5477b-6rkxm                   1/1     Running   0          3d15h
oidc-gatekeeper-0                                2/2     Running   0          3d15h
pvcviewer-operator-0                             2/2     Running   0          3d15h
resource-dispatcher-0                            2/2     Running   0          3d15h
tensorboard-controller-0                         2/2     Running   0          3d15h
tensorboards-web-app-0                           2/2     Running   0          3d15h
training-operator-0                              1/1     Running   0          3d15h
training-operator-7d6446b8c-zcg8v                1/1     Running   0          3d15h

Additional context

This environment is running on Azure AKS cluster. We have 2 identical clusters deployed and the issue only happens on one of them. But it happens not only with kfp-db, but katib-db also.

Workaround to get the cluster member back into a healthy state is do delete the pod.

@natalytvinova natalytvinova added the bug Something isn't working label Nov 29, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6128.

This message was autogenerated

@sagittariuslee
Copy link

From cos, we received the alert

100% of the juju_kubeflow_bbbc1592_katib-db_prometheus_scrape-0/ targets in namespace are down.

The alert got self-resolved after 7 hours and 25 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants