Skip to content

Latest commit

 

History

History
203 lines (149 loc) · 8.43 KB

troubleshooting-dspa-component-errors.adoc

File metadata and controls

203 lines (149 loc) · 8.43 KB

Troubleshooting DSPA component errors

This table displays common errors found in DataSciencePipelinesApplication (DSPA) components, along with their statuses, messages, and proposed solutions. The Ready condition type accumulates errors from various DSPA components, providing a status view of the DSPA deployment.

Type Status Error message Solution

ObjectStorageAvailable

Ready

False

False

Could not connect to Object Store: tls: failed to verify certificate: x509: certificate signed by unknown authority

This issue occurs in clusters that use self-signed certificates with {productname-short} version 2.9 or later. The data science pipelines manager cannot connect to the object storage as it does not trust the object storage SSL certificate. Therefore, the pipeline server cannot be created. Contact your IT operations administrator to add the relevant Certificate Authority bundle.

For more information, see Working with certificates.

ObjectStorageAvailable

Ready

False

False

Could not connect to Object Store Deployment for component "ds-pipeline-pipelines-definition" is missing - prerequisite component might not yet be available. Deployment for component "ds-pipeline-persistenceagent-pipelines-definition" is missing - prerequisite component might not yet be available. Deployment for component "ds-pipeline-scheduledworkflow-pipelines-definition" is missing - prerequisite component might not yet be available.

In clusters running {productname-short} 2.8.x, the data science pipelines manager might fail to connect to the object storage, and the pipeline server might not be created.

Ensure that your object store credentials and connection information are accurate, and verify that the object store is accessible from within the data science project’s associated OpenShift namespace. One common issue is that the object storage SSL certificate is not trusted, particularly if self-signed certificates are used.

Verify and update your object storage credentials, then retry the operation.

ObjectStorageAvailable

Ready

False

False

Wrong credentials for Object Storage: Could not connect to (minio-my-project.apps.my-cluster.com), Error: The request signature we calculated does not match the signature you provided. Check your key and signing method.

Provide the correct credentials for your object storage and retry the operation.

DatabaseAvailable

Ready

False

False

FailingToDeploy: Dial tcp XXX.XX.XXX.XXX:3306 : i/o timeout

If the issue persists beyond startup, check for network issues or misconfigurations in the database connection settings.

DatabaseAvailable

Ready

False

False

Unable to connect to external database: tls: failed to verify certificate: x509: certificate signed by unknown authority

This issue can occur when you use any external database, such as Amazon RDS. The data science pipelines manager cannot connect to the database because it does not trust the database SSL certificate, preventing the creation of the pipeline server. Contact your IT operations administrator to add the relevant certificates.

For more information, see Working with certificates.

DatabaseAvailable

Ready

False

False

Error 1129: Host 'A.B.C.D' is blocked because of many connection errors.

This issue might occur when using an external database, such as Amazon RDS. Initially, the pipeline server is created successfully. However, after some time, the {productname-short} dashboard displays an "Error displaying pipelines" message, and the DSPA conditions indicate that the host is blocked due to multiple connection errors.

For more information on how resolve this issue for an external Amazon RDS database, see Resolving "Host is blocked because of many connection errors" error in Amazon RDS for MySQL.

APIServerReady

Ready

False

False

Route creation failed due to lengthy project name: Route.route.openshift.io is invalid: spec.host exceeds 63 characters.

Ensure that the project name in OpenShift is less than 40 characters.

APIServerReady

Ready

False

False

FailingToDeploy: Component replica failed to create. Message: serviceaccount "ds-pipeline-sample" not found.

If the failure persists for more than 25 seconds during DSPA startup, recreate the missing service account.

PersistenceAgentReady

Ready

False

False

FailingToDeploy: Component’s replica failed to create. Message: serviceaccount "ds-pipeline-persistenceagent-sample" not found.

If the failure persists for more than 25 seconds during DSPA startup, recreate the missing service account.

ScheduledWorkflowReady

Ready

False

False

FailingToDeploy: Component’s replica failed to create. Message: serviceaccount "ds-pipeline-scheduledworkflow-sample" not found.

If the failure persists for more than 25 seconds during DSPA startup, recreate the missing service account

MLMDProxyReady

Ready

False

False

Deploying: Component [ds-pipeline-scheduledworkflow-sample] is still deploying.

Wait for DSPA startup to complete. If deployment fails after 25 seconds, check the logs for further information.

Common errors across DSP components

The following table lists errors that might occur across multiple DSPA components:

Deployment condition Condition type Status Error message Solution

Component Deployment Not Found

ComponentDeploymentNotFound

False

Deployment for component <component> is missing - prerequisite component might not yet be available.

The deployment for the component does not exist. Typically, this is due to missing deployments or issues that occurred during creation.

Deployment Scaled Down

MinimumReplicasAvailable

False

Deployment for component <component> is scaled down.

The component is unavailable as the deployment replica count is set to zero.

Component Failing to Progress

FailingToDeploy

False

Component <component> has failed to progress. Reason: <progressingCond.Reason>. Message: <progressingCond.Message>

The deployment has stalled due to ProgressDeadlineExceeded or ReplicaSetCreateError issues, or similar.

Replica Creation Failure

FailingToDeploy

False

Component’s replica <component> has failed to create. Reason: <replicaFailureCond.Reason>. Message: <replicaFailureCond.Message>

Replica creation has failed, typically due to an error in the replica set or with the service accounts.

Pod-Level Failures

FailingToDeploy

False

Concatenated failure messages for each pod.

Deployment pods are in a failed state. Check the pod logs for further information.

Pod in CrashLoopBackOff

FailingToDeploy

False

Component <component> is in CrashLoopBackOff. Message from pod: <crashLoopBackOffMessage>

Pod containers are failing repeatedly, often due to incorrect environment variables or missing service accounts.

Component Deploying (No Errors)

Deploying

False

Component <component> is deploying.

The component deployment process is ongoing with no errors detected.

Component Minimally Available

MinimumReplicasAvailable

True

Component <component> is minimally available.

The component is available, but only with the minimum number of replicas running.