To use the distributed workloads feature in {productname-short}, you must install several components.
-
You have installed {productname-long}.
-
You have removed any previously installed instances of the CodeFlare Operator, as described in the Knowledgebase solution How to migrate from a separately installed CodeFlare Operator in your data science cluster.
-
If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}. See Enabling NVIDIA GPUs.
-
Click the Data Science Cluster tab.
-
Click the default instance name (for example, default-dsc) to open the instance details page.
-
Click the YAML tab to show the instance specifications.
-
Enable the required distributed workloads components. In the
spec.components
section, set themanagementState
field correctly for the required components:-
If you want to use the CodeFlare framework to tune models, enable the
codeflare
,kueue
, andray
components. -
If you want to use the Kubeflow Training Operator to tune models, enable the
kueue
andtrainingoperator
components. -
The list of required components depends on whether the distributed workload is run from a pipeline or notebook or both, as shown in the following table.
Table 1. Components required for distributed workloads Component Pipelines only Notebooks only Pipelines and notebooks codeflare
Managed
Managed
Managed
dashboard
Managed
Managed
Managed
datasciencepipelines
Managed
Removed
Managed
kueue
Managed
Managed
Managed
ray
Managed
Managed
Managed
trainingoperator
Managed
Managed
Managed
workbenches
Removed
Managed
Managed
-
-
Click Save. After a short time, the components with a
Managed
state are ready.
Check the status of the codeflare-operator-manager, kuberay-operator, and kueue-controller-manager pods, as follows:
-
Click Workloads → Deployments.
-
Search for the codeflare-operator-manager, kuberay-operator, and kueue-controller-manager deployments. In each case, check the status as follows:
-
Click the deployment name to open the deployment details page.
-
Click the Pods tab.
-
Check the pod status.
When the status of the codeflare-operator-manager-<pod-id>, kuberay-operator-<pod-id>, and kueue-controller-manager-<pod-id> pods is Running, the pods are ready to use.
-
To see more information about each pod, click the pod name to open the pod details page, and then click the Logs tab.
-
Configure the distributed workloads feature as described in Managing distributed workloads.