From 91041ec104480af45f60bfa78ff6aa0f832babb8 Mon Sep 17 00:00:00 2001 From: VaniHaripriya Date: Mon, 20 Jan 2025 17:09:01 -0600 Subject: [PATCH] ILAB Phase 3 Github Markdown Documentation Signed-off-by: VaniHaripriya --- run-pipeline-on-rhoai-.md | 730 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 730 insertions(+) create mode 100644 run-pipeline-on-rhoai-.md diff --git a/run-pipeline-on-rhoai-.md b/run-pipeline-on-rhoai-.md new file mode 100644 index 0000000..a200ac1 --- /dev/null +++ b/run-pipeline-on-rhoai-.md @@ -0,0 +1,730 @@ +# Running InstructLab Pipeline with Data Science Pipelines on RHOAI + +This file provides step-by-step instructions for setting up and using the Data Science Pipelines (DSP) for InstructLab iterations. + +## Pre-requisites + +* An OpenShift cluster with + * Sufficient GPUs available for training. + * At a minimum, a node with at least 4 GPUs such as an NVIDIA A100s + * The following Operators already installed: + * Red Hat - Authorino + * Red Hat Openshift Serverless + * Red Hat OpenShift Service Mesh v2 + * NOTE: v3 is not compatible with RHOAI + * Red Hat Openshift AI and operator dependencies documented at [OpenShift AI Supported Configurations](https://access.redhat.com/articles/rhoai-supported-configs) + +* Teacher and Judge models with a serving endpoint + * If already setup you will need the endpoint, api key, and any CA bundles if needed for each model + * If setting up your own using these instructions, you will need additional multi-node A100s or L40s for each model +* SDG taxonomy tree to utilize for Synthetic Data Generation (SDG) + * see instructions for creating a [taxonomy tree] + on how to set up your own taxonomy tree. +* An OpenShift AI installation, with the Training Operator and KServe components set to `Managed` + * A data science project/namespace, in this document this will be referred to as `` +* A [StorageClass] that supports dynamic provisioning with [ReadWriteMany] access mode (see step 3 below). +* An S3 object store such as AWS, or an alternative object storage solution that is S3-compliant such as Ceph, Nooba and MinIO. +* A locally installed `oc` command line tool to create and manage kubernetes resources. +* Ilab CLI (or Skopeo/Oras/etc.) for model downloads +* For Disconnected Clusters: + * **Mirror Required Images**: + + In a disconnected environment, you must mirror the following container images to your internal registry before running the pipeline. Use tools like `oc adm release mirror`, `skopeo`, or `oras` to mirror these images: + - `registry.redhat.io/ubi9/toolbox@sha256:da31dee8904a535d12689346e65e5b00d11a6179abf1fa69b548dbd755fa2770` + - `registry.redhat.io/openshift4/ose-cli@sha256:1d5c8442a6ec745e6ae44a7738c0681f1e21aac8be76ba826c2ddf2eed8475db` + - `registry.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:b3dc9af0244aa6b84e6c3ef53e714a316daaefaae67e28de397cd71ee4b2ac7e` + - `registry.redhat.io/rhelai1/skills-adapter-v3@sha256:53dd11a762bb39fc33c15499891309f0cdc8dbfd02abf94c9c60aad643aca255` + - `registry.redhat.io/rhelai1/knowledge-adapter-v3@sha256:ef1608ec78d5e39655b505544c0f30a015a6c9cb7e2b2deffe394791f8c76c6f` + - `registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1@sha256:bc08e466aa35352a621d0ad221c2e247ff9751f4cb6cffe00d5894ce6bfd3fd7` + - `registry.redhat.io/rhelai1/prometheus-8x7b-v2-0@sha256:9fcb49c230f6e73ff944377307bb83a05ae3ac20300af75e429151f4f8bf4285` + - `quay.io/modh/odh-generic-data-science-notebook@sha256:7c1a4ca213b71d342a2d1366171304e469da06d5f15710fab5dd3ce013aa1b73` + - `quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8` + - `quay.io/modh/vllm@sha256:97b91f9bd71202f5de8d379cfb61baec887b47f836a2ff8b158c946196de5660` + - `quay.io/opendatahub/workbench-images@sha256:7f26f5f2bec4184af15acd95f29b3450526c5c28c386b6cb694fbe82d71d0b41` + - `ghcr.io/oras-project/oras:main@sha256:8859e7e3ae510fb921ebeb109ac9d3e3bb91799e0d52001ae456df33929029db` + + * **500GB PersistentVolumeClaim (PVC) for Mixtral**: + + The proposed method to deploy Mixtral requires a **500GB PVC**. + - In a **disconnected cluster**, ensure that your OpenShift environment has sufficient storage capacity and a **StorageClass** configured to provision this PVC. + - If automatic PVC creation fails, you may need to manually create a PersistentVolume (PV) and bind it to a PVC. + * **Additional Git Repository for Mirroring**: + + - The iLab pipeline uses a taxonomy repo from GitHub. + - The following repository should be mirrored for use in a disconnected cluster: + + ```bash + Original repo: github.com/sallyom/taxonomy.git + repo_branch: utah-hockey + ``` + +[StorageClass]: https://kubernetes.io/docs/concepts/storage/storage-classes/ +[taxonomy tree]: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.2/html/creating_a_custom_llm_using_rhel_ai/customize_taxonomy_tree +[ReadWriteMany]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes + +## Steps + +Before running the training and evaluation steps we must complete the following steps: + +1. [Prepare base model and push to object store](#prepare-base-model-and-push-to-object-store) +1. [Setting up Judge & Teacher model](#setting-up-judge--teacher-model) + * [Deploy a judge model server](#deploy-a-judge-model-server-optional) (Optional) + * [Deploy judge model serving details](#deploy-judge-model-serving-details) + * [Deploy a teacher model server](#deploy-a-teacher-model-server-optional) (Optional) + * [Deploy teacher model serving details](#deploy-teacher-model-serving-details) +1. [Setup NFS StorageClass](#optional---setup-nfs-storageclass) (Optional) +1. [Set Up Data Science Pipelines Server and Run InstructLab Pipeline](#set-up-data-science-pipelines-server-and-run-instructLab-pipeline) + +### Prepare base model and push to object store + +You will need a base model to train the ilab pipeline on, so to begin, upload the [granite-7b-starter] model to your object store. + +```bash +$ mkdir -p s3-data/ +``` + +Download ilab model repository in s3-data model directory +```bash +# You can also use Oras or Skopeo cli tools to download the model +# If using other tools besides ilab, ensure that filenames are mapped +# appropriately +$ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.2 +$ cp -r /rhelai1/granite-7b-starter s3-data/granite-7b-starter +``` + +Generate tar archive +```bash +$ cd s3-data +$ tar -czvf rhelai.tar.gz * +``` + +Upload the created tar archive to your object store. +```bash +# Default cache location for ilab model download is ~/.cache/instructlab/models +# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors +s3cmd sync s3-data/granite-7b-starter s3:///granite-7b-starter +``` + +[granite-7b-starter]: https://catalog.redhat.com/software/containers/rhelai1/granite-7b-starter/667ebf10abaa082bcf96ea6a + +### Setting up Judge & Teacher model + +The Teacher model is used for Synthetic Data Generation (SDG) while the Judge model is used for model evaluation. + +For the Teacher model you need [mixtral-8x7b-instruct-v0-1] deployed with [skills-adapter-v3:1.2] and +[knowledge-adapter-v3:1.2] LoRA layered skills and knowledge adapters. + +For the Judge model you will need the [prometheus-8x7b-v2-0 model]. + +If you already have these models deployed you can skip the deployment steps and go straight to the secret set up for +Judge and Teacher respectively. + +[mixtral-8x7b-instruct-v0-1]: https://catalog.redhat.com/software/containers/rhelai1/mixtral-8x7b-instruct-v0-1/6682619da4ea27a10a36e4c6 +[skills-adapter-v3:1.2]: https://catalog.redhat.com/software/containers/rhelai1/skills-adapter-v3/66a89d60ac8385376a3dabed +[knowledge-adapter-v3:1.2]: https://catalog.redhat.com/software/containers/rhelai1/knowledge-adapter-v3/66a89ce222c498e45a8e0042 +[prometheus-8x7b-v2-0 model]: https://catalog.redhat.com/software/containers/rhelai1/prometheus-8x7b-v2-0/6682611224ba6617d472f451 + +#### Deploy a judge model server (optional) + +Create a service account to be used for token authentication + +```yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: judge-sa + namespace: +``` + +Upload [prometheus-8x7b-v2-0 model] (Judge-Model) to the same object storage as before. + +For example using `ilab` to download and `s3cmd` to sync to object store you can do: +```bash +# You can also use Oras or Skopeo cli tools to download the model +# If using other tools besides ilab, ensure that filenames are mapped +# appropriately +ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.2 + +# Default cache location for ilab model download is ~/.cache/instructlab/models +s3cmd sync path/to/model s3://your-bucket-name/judge-model/ +``` + +Navigate to the OpenShift AI dashboard +* Choose **Data Science Projects** from the left hand menu and choose your data science project/namespace. +* Select the **Connections** tab, and then click on the **Add connection** button. Enter the details of your S3 bucket (object store) and click **Add data connection**. + +> [!NOTE] +> Note: Before following the next step - Ensure that the `CapabilityServiceMeshAuthorization` status is `True` in `DSCinitialization` resource. + +Create a model server instance +* Navigate back to **Data Science Projects** page, select your namespace again, and then select the **Models** tab +* On the right hand side select **Deploy model** under Single-model serving platform +* Under Serving runtime choose the serving runtime `vLLM Serving Runtime for Kserve`. +* Check the `Make deployed models available through an external route` box. +* Under token authentication check the `Require token authentication` box, write the name of the service account that we have created above. +* Choose the existing data connection created earlier. +* Click deploy. + +[prometheus-8x7b-v2-0 model]: https://catalog.redhat.com/software/containers/rhelai1/prometheus-8x7b-v2-0/6682611224ba6617d472f451 + +#### Deploy judge model serving details + +Create a secret containing the judge model serving details + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: + namespace: +type: Opaque +stringData: + JUDGE_NAME: # Name of the judge model or deployment + JUDGE_ENDPOINT: # Model serving endpoint, Sample format - `https:///v1` + JUDGE_API_KEY: # Deployed model-server auth token + JUDGE_CA_CERT: # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt` + JUDGE_CA_CERT_CM_KEY: # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt` +``` + +> [!NOTE] +> Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and key +> are then provided as a parameter to the pipeline as well as in the `judge-serving-details` secret above. + +If you deployed the Judge server model using the optional instructions above then you can retrieve `JUDGE_API_KEY` by +running the following command: + +```bash +JUDGE_API_KEY=$(oc -n create token judge-sa) +``` + +#### Deploy a teacher model server (Optional) + +Unlike the Judge model we have to deploy the Teacher model manually on RHOAI, this consists of deploying the K8s resources +using `oc`. + +First, upload the Teacher model to s3 if it does not already exist there: + +```bash +# You can also use ORAS or Skopeo cli tools to download the model +# If using other tools besides ilab, ensure that filenames are mapped +# appropriately +ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.2 + +# Default cache location for ilab model download is ~/.cache/instructlab/models +# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors +s3cmd sync path/to/model s3://your-bucket-name/teach-model/ +``` + +Deploy the following `yaml` called `pre_requisites.yaml` to the `` +
+ +pre_requisites.yaml + +```yaml +--- +kind: ServiceAccount +apiVersion: v1 +metadata: + name: mixtral-sa +--- +kind: Role +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: mixtral-view-role + labels: + opendatahub.io/dashboard: 'true' +rules: + - verbs: + - get + apiGroups: + - serving.kserve.io + resources: + - inferenceservices + resourceNames: + - mixtral +--- +kind: RoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: mixtral-view + labels: + opendatahub.io/dashboard: 'true' +subjects: + - kind: ServiceAccount + name: mixtral-sa +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: mixtral-view-role +--- +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: mixtral-serving-ilab + labels: + opendatahub.io/dashboard: 'true' +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 200Gi + storageClassName: standard-csi + volumeMode: Filesystem +``` + +
+ +```bash +oc -n apply -f pre_requisites.yaml +``` + +You will need to ensure that the `storage-config` secret exists in the `` namespace. +And this `storage-config` has the configuration for the bucket where the teacher model is stored. + +```yaml +apiVersion: v1 +stringData: + aws-connection-my-bucket: | + { + "type": "s3", + "access_key_id": "your_accesskey", + "secret_access_key": "your_secretkey", + "endpoint_url": "https://s3-us-east-2.amazonaws.com", + "bucket": "mybucket", + "default_bucket": "mybucket", + "region": "us-east-2" + } +kind: Secret +metadata: + name: storage-config +type: Opaque +``` +If this secret does not exist in this namespace, then create it. If it does exist, then ensure there is an entry +for the bucket that stores the teacher model. The `key` is used in the `InferenceService` spec below. + +Next we need to create the custom `ServingRuntime` and `InferenceService`. + +Similar to above, deploy the following `yaml` files to the namespace `` + +You will need to update the `spec.model.storage.path` in the `InferenceService` to match the path where the model files are +stored in your bucket. The `key` should match the value in your `storage-config` secret that has the bucket credentials. +In our example above we use `aws-connection-my-bucket`. + +
+servingruntime.mixtral.yaml + +```yaml +--- +apiVersion: serving.kserve.io/v1alpha1 +kind: ServingRuntime +metadata: + annotations: + opendatahub.io/accelerator-name: migrated-gpu + opendatahub.io/apiProtocol: REST + opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' + opendatahub.io/template-display-name: Mixtral ServingRuntime + opendatahub.io/template-name: vllm-runtime + openshift.io/display-name: mixtral + labels: + opendatahub.io/dashboard: "true" + name: mixtral +spec: + annotations: + prometheus.io/path: /metrics + prometheus.io/port: "8080" + containers: + - args: + - --port=8080 + - --model=/mnt/models + - --served-model-name={{.Name}} + - --distributed-executor-backend=mp + command: + - python + - -m + - vllm.entrypoints.openai.api_server + env: + - name: HF_HOME + value: /tmp/hf_home + image: quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8 + name: kserve-container + ports: + - containerPort: 8080 + protocol: TCP + volumeMounts: + - mountPath: /dev/shm + name: shm + - mountPath: /mnt + name: mixtral-serve + multiModel: false + storageHelper: + disabled: true + supportedModelFormats: + - autoSelect: true + name: vLLM + volumes: + - name: mixtral-serve + persistentVolumeClaim: + claimName: mixtral-serving-ilab + - emptyDir: + medium: Memory + sizeLimit: 2Gi + name: shm +``` + +
+ +
+ +inferenceservice.mixtral.yaml + +```yaml +--- +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + annotations: + openshift.io/display-name: mixtral + security.opendatahub.io/enable-auth: "true" + serving.knative.openshift.io/enablePassthrough: "true" + sidecar.istio.io/inject: "true" + sidecar.istio.io/rewriteAppHTTPProbers: "true" + finalizers: + - inferenceservice.finalizers + labels: + opendatahub.io/dashboard: "true" + name: mixtral +spec: + predictor: + maxReplicas: 1 + minReplicas: 1 + model: + args: + - --dtype=bfloat16 + - --tensor-parallel-size=4 + - --enable-lora + - --max-lora-rank=64 + - --lora-dtype=bfloat16 + - --fully-sharded-loras + - --lora-modules + - skill-classifier-v3-clm=/mnt/models/skills + - text-classifier-knowledge-v3-clm=/mnt/models/knowledge + modelFormat: + name: vLLM + name: "" + resources: + limits: + cpu: "4" + memory: 60Gi + nvidia.com/gpu: "4" + requests: + cpu: "4" + memory: 60Gi + nvidia.com/gpu: "4" + runtime: mixtral + storage: + # the secret name of the secret deployed earlier + key: aws-connection-my-bucket + # update this to match the path in your bucket + path: + tolerations: + - effect: NoSchedule + key: nvidia.com/gpu + operator: Exists +``` + +
+ +```bash +oc -n apply -f servingruntime.mixtral.yaml +oc -n apply -f inferenceservice.mixtral.yaml +``` + +A new pod named `mixtral-predictor-0000#-deployment-` should be created. This should result in a successful +running pod. If the pod does not come up successfully, you inspect the `.status` field for the `InferenceService` +for issues. + +```bash +oc -n get inferenceservice mixtral -o yaml +``` + +#### Deploy teacher model serving details + +Create a secret containing the Teacher model serving details + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: + namespace: +type: Opaque +stringData: + api_key: # Deployed model-server auth token + endpoint: # Model serving endpoint, Sample format - `https:///v1` + model: # Name of the teacher model or deployment + SDG_CA_CERT: # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt` + SDG_CA_CERT_CM_KEY: # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt` +``` + +> [!NOTE] +> Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and +> key are then provided as a parameter to the pipeline as well as in the `teacher-model-details-k8s-secret` secret above. + +If you deployed the Teacher server model using the optional instructions above then you can retrieve `api_key` by +running the following command: + +```bash +SDG_API_KEY=$(oc -n create token mixtral-sa) +``` + +### (Optional) - Setup NFS StorageClass + +> [!CAUTION] +> The image provided here is for test purposes only. +> Users must provide a production ready storageclass with ReadWriteMany capability. + +This step is needed when the cluster doesn't have a storage provisioner capable of provisioning PersistentVolumeClaim with ReadWriteMany capability. + +Installing the NFS CSI driver +```bash +$ curl -skSL https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/v4.9.0/deploy/install-driver.sh | bash -s v4.9.0 --` +``` + +For deploying an in-cluster NFS server + +```bash +oc new-project nfs +oc apply -f ./nfs-server-deployment.yaml +``` + +> [!NOTE] +> Note: Check the root PersistentVolumeclaim that'll be created and the requested storage. + +For creating NFS storage-class +```bash +oc apply -f ./nfs-storage-class.yaml +``` +#### Accelerator Profile: +An accelerator profile must also be defined within the RHOAI dashboard or via CLI to enable GPU acceleration for model serving with Kserve Serving. + +``` +apiVersion: v1 +items: +- apiVersion: dashboard.opendatahub.io/v1 + kind: AcceleratorProfile + metadata: + name: gpu + namespace: redhat-ods-applications + spec: + displayName: gpu + enabled: true + identifier: nvidia.com/gpu + tolerations: [] +``` +#### Signed Certificate: +A signed certificate ensures that there are not any unnecessary issues when running the training pipeline. + +To deploy a signed certificate in your cluster follow [trusted cluster cert](signed-certificate/README.md) documentation. + +This will create the required resources in the cluster, including the required StorageClass. + +### Set Up Data Science Pipelines Server and Run InstructLab Pipeline + +Now we can continue to set up the required resources in our cluster. + +The following resources will be created: + +1. Secret +1. ClusterRole +1. ClusterRoleBinding +1. Pod + +Create a secret resource that contains the credentials for your Object Storage (AWS S3 Bucket) + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: sdg-object-store-credentials +type: Opaque +stringData: + bucket: # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket) + access_key: # The object store access key (AWS Access key ID) + secret_key: # The object store secret key (AWS Secret Access Key) + data_key: # The name of the tarball that contains SDG data. + endpoint: # The object store endpoint + region: # The region for the object store. + verify_tls: true # Verify TLS for the object store. +``` + +Apply the yaml file to the cluster + +Create a ServiceAccount, ClusterRole and ClusterRoleBinding + +Provide access to the service account running the pipeline for accessing and manipulating related resources. + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + namespace: + name: secret-access-role +rules: + - apiGroups: [""] + resources: ["pods", "pods/log", "configmaps", "persistentvolumeclaims", "secrets","events"] + verbs: ["get", "list", "watch", "create", "update", "delete"] + + - apiGroups: ["batch"] + resources: ["jobs"] + verbs: ["get", "list", "create", "watch"] + + - apiGroups: ["kubeflow.org"] + resources: ["pytorchjobs"] + verbs: ["get", "list", "create", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: secret-access-binding +subjects: + - kind: ServiceAccount + name: # created above in Step-2 + namespace: +roleRef: + kind: ClusterRole + name: secret-access-role + apiGroup: rbac.authorization.k8s.io +``` +Apply the yaml to the cluster. + +These are the required [RBAC configuration] which we are applying on the ServiceAccount. + +#### Configure Data Science Pipeline Server: + +From within the RHOAI dashboard, navigate to the **Data Science Pipelines** page and click **Configure pipeline server**. This will present you with a form where you can upload the credentials for the S3 bucket you created in the previous step. + +

+ +### Run the Pipeline + +Now that all the cluster requirements have been setup, we are ready to upload and run our InstructLab pipeline! + +#### Upload the Pipeline: + +Now we can go back to our RHOAI Data Science Pipelines dashboard and select **Import pipeline**. We recommend importing the pipeline yaml directly from the github repo using: `https://raw.githubusercontent.com/opendatahub-io/ilab-on-ocp/refs/heads/main/pipeline.yaml` +

+ +**Note**: While using a **Disconnected Cluster**, the pipeline should be imported from a file instead of a URL.Using a URL fails in disconnected clusters as they cannot connect to GitHub. + +#### Create a Run: +Once the pipeline is uploaded we will be able to select **Create run** from the **Actions** dropdown. This will present us with a number of parameters we can set to customize our run. Click **Create run** at the bottom of the page to kick off your InstructLab pipeline. + +

+ +#### Available Pipeline Parameters: + +| Parameter | Definition | +|---------- | ---------- | +|`sdg_repo_url` | SDG parameter. Points to a taxonomy git repository| +|`sdg_repo_branch` | SDG parameter. Points to a branch within the taxonomy git repository. If set, has priority over sdg_repo_pr| +|`sdg_repo_pr` |SDG parameter. Points to a pull request against the taxonomy git repository| +|`sdg_base_model` |SDG parameter. LLM model used to generate the synthetic dataset| +|`sdg_scale_factor` |SDG parameter. The total number of instructions to be generated| +|`sdg_pipeline` |SDG parameter. Data generation pipeline to use. Available: 'simple', 'full', or a valid path to a directory of pipeline workflow YAML files. Note that 'full' requires a larger teacher model, Mixtral-8x7b.| +|`sdg_max_batch_len` |SDG parameter. Maximum tokens per gpu for each batch that will be handled in a single step.| +|`sdg_sample_size` | SDG parameter. Sampling size used for Synthetic Data Generation | +|`train_nproc_per_node` |Training parameter. Number of GPUs per each node/worker to use for training.| +|`train_nnodes` |Training parameter. Number of nodes/workers to train on.| +|`train_num_epochs_phase_1` |Training parameter for in Phase 1. Number of epochs to run training.| +|`train_num_epochs_phase_2` |Training parameter for in Phase 2. Number of epochs to run training.| +|`train_effective_batch_size_phase_1` |Training parameter for in Phase 1. The number of samples in a batch that the model should see before its parameters are updated.| +|`train_effective_batch_size_phase_2` |Training parameter for in Phase 2. The number of samples in a batch that the model should see before its parameters are updated.| +|`train_learning_rate_phase_1` |Training parameter for in Phase 1. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size.| +|`train_learning_rate_phase_2` |Training parameter for in Phase 2. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size.| +|`train_num_warmup_steps_phase_1` |Training parameter for in Phase 1. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate.| +|`train_num_warmup_steps_phase_2` |Training parameter for in Phase 2. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate.| +|`train_save_samples` |Training parameter. Number of samples the model should see before saving a checkpoint.| +|`train_max_batch_len` |Training parameter. Maximum tokens per gpu for each batch that will be handled in a single step.| +|`train_seed` |Training parameter. Random seed for initializing training.| +|`mt_bench_max_workers` |MT Bench parameter. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'.| +|`mt_bench_merge_system_user_message` |MT Bench parameter. Boolean indicating whether to merge system and user messages (required for Mistral based judges)| +|`final_eval_max_workers` |Final model evaluation parameter for MT Bench Branch. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'.| +|`final_eval_few_shots` |Final model evaluation parameter for MMLU. Number of question-answer pairs provided in the context preceding the question used for evaluation.| +|`final_eval_batch_size` |Final model evaluation parameter for MMLU. Batch size for evaluation. Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory.| +|`final_eval_merge_system_user_message` |Final model evaluation parameter for MT Bench Branch. Boolean indicating whether to merge system and user messages (required for Mistral based judges)| +|`k8s_storage_class_name` |A Kubernetes StorageClass name for persistent volumes. Selected StorageClass must support RWX PersistentVolumes.| + +##### Suggested Parameters: Full Pipeline +To run the ilab Pipeline at full capabilities, we suggest using these values: +| Parameter | Suggested Value | +|---------- | ---------- | +|`sdg_repo_url` | https://github.com/instructlab/taxonomy.git | +|`sdg_repo_branch` | "" | +|`sdg_repo_pr` | 0 | +|`sdg_base_model` | s3:/// | +|`sdg_scale_factor` | 30 | +|`sdg_pipeline` | "full" | +|`sdg_max_batch_len` | 5000 | +|`sdg_sample_size` | 1.0 | +|`train_nproc_per_node` | 2 | +|`train_nnodes` | 2 | +|`train_num_epochs_phase_1` | 7 | +|`train_num_epochs_phase_2` | 10 | +|`train_effective_batch_size_phase_1` | 128 | +|`train_effective_batch_size_phase_2` | 3840 | +|`train_learning_rate_phase_1` | 2e-05 | +|`train_learning_rate_phase_2` | 6e-06 | +|`train_num_warmup_steps_phase_1` | 1000 | +|`train_num_warmup_steps_phase_2` | 1000 | +|`train_save_samples` | 250000 | +|`train_max_batch_len` | 5000 | +|`train_seed` | 42 | +|`mt_bench_max_workers` | "auto" | +|`mt_bench_merge_system_user_message` | False | +|`final_eval_max_workers` | "auto" | +|`final_eval_few_shots` | 5 | +|`final_eval_batch_size` | "auto" | +|`final_eval_merge_system_user_message` | False | +|`k8s_storage_class_name` | standard | +Note that this will take a very long time, on the scale of double-digit hours of runtime. +##### Suggested Parameters: Development +Running the ilab pipeline at full capabilities takes a very long time, and with a good amount of resource consumption. +To create a e2e run that completes much quicker (at the expense of output quality), and with fewer resources (namely, GPU nodes) we suggest using these values instead: +| Parameter | Suggested Value | +|---------- | ---------- | +|`sdg_repo_url` | https://github.com/instructlab/taxonomy.git | +|`sdg_repo_branch` | "" | +|`sdg_repo_pr` | 0 | +|`sdg_base_model` | s3:/// | +|`sdg_scale_factor` | 30 | +|`sdg_pipeline` | "simple" | +|`sdg_max_batch_len` | 5000 | +|`sdg_sample_size` | **0.0002** | +|`train_nproc_per_node` | **1** | +|`train_nnodes` | **1** | +|`train_num_epochs_phase_1` | **2** | +|`train_num_epochs_phase_2` | **2** | +|`train_effective_batch_size_phase_1` | **3840** | +|`train_effective_batch_size_phase_2` | 3840 | +|`train_learning_rate_phase_1` | **.0001** | +|`train_learning_rate_phase_2` | **.0001** | +|`train_num_warmup_steps_phase_1` | **800** | +|`train_num_warmup_steps_phase_2` | **800** | +|`train_save_samples` | **0** | +|`train_max_batch_len` | **20000** | +|`train_seed` | 42 | +|`mt_bench_max_workers` | "auto" | +|`mt_bench_merge_system_user_message` | False | +|`final_eval_max_workers` | "auto" | +|`final_eval_few_shots` | 5 | +|`final_eval_batch_size` | "auto" | +|`final_eval_merge_system_user_message` | False | +|`k8s_storage_class_name` | standard | + +Using these parameters will allow a user to run the complete pipeline much quicker; in testing we have found this to take about 90 minutes. +Additionally, we can point the `judge-server` and `teacher-server` to the same Mistral model, which only uses 1 GPU, and the PyTorchJob configuration +specified here also only uses 2 training nodes of 1 GPU, so a total of 3 GPUs are required, rather than the 8-9 GPUs required for the full pipeline. +With that said, the output model quality is likely very poor, and these should only be used for testing purposes. + +[RBAC configuration]: https://github.com/opendatahub-io/ilab-on-ocp/tree/main/standalone#rbac-requirements-when-running-in-a-kubernetes-job