Update doc

opendatahub-io · Jan 28, 2025 · 2914019 · 2914019
1 parent ba72eda
commit 2914019
Showing 1 changed file with 111 additions and 46 deletions.
diff --git a/instructlab-pipelines-with-dsp-on-rhoai-.md → run-pipeline-on-rhoai-.md b/instructlab-pipelines-with-dsp-on-rhoai-.md → run-pipeline-on-rhoai-.md
@@ -35,7 +35,7 @@ This file provides step-by-step instructions for setting up and using the Data S
 
 Before running the training and evaluation steps we must complete the following steps:
 
-1. [Prepare data and push to object store](#prepare-data-and-push-to-object-store)
+1. [Prepare base model and push to object store](#prepare-base-model-and-push-to-object-store)
 1. [Setting up Judge & Teacher model](#setting-up-judge--teacher-model)
     * [Deploy a judge model server](#deploy-a-judge-model-server-optional) (Optional)
     * [Deploy judge model serving details](#deploy-judge-model-serving-details)
@@ -44,12 +44,12 @@ Before running the training and evaluation steps we must complete the following
 1. [Setup NFS StorageClass](#optional---setup-nfs-storageclass) (Optional)
 1. [Set Up Data Science Pipelines Server and Run InstructLab Pipeline](#set-up-data-science-pipelines-server-and-run-instructLab-pipeline)
 
-### Prepare data and push to object store
+### Prepare base model and push to object store
 
-Download [granite-7b-starter] model and push it to your object store.
+You will need a base model to train the ilab pipeline on, so to begin, upload the [granite-7b-starter] model to your object store.
 
 ```bash
-$ mkdir -p s3-data/{model}
+$ mkdir -p s3-data/
 ```
 
 Download ilab model repository in s3-data model directory
@@ -58,19 +58,20 @@ Download ilab model repository in s3-data model directory
 # If using other tools besides ilab, ensure that filenames are mapped
 # appropriately
 $ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.2
-$ cp -r <path-to-model-downloaded-dir>/rhelai1/granite-7b-starter/* s3-data/model
+$ cp -r <path-to-model-downloaded-dir>/rhelai1/granite-7b-starter s3-data/granite-7b-starter
 ```
 
-Upload the downloaded model to your object store.
-
-The `pipeline.py` script will do a simple validation check on the directory structure, here is a sample of what
-the script expects:
+Generate tar archive
+```bash
+$ cd s3-data
+$ tar -czvf rhelai.tar.gz *
+```
 
-```text
-model/config.json
-model/tokenizer.json
-model/tokenizer_config.json
-model/*.safetensors
+Upload the created tar archive to your object store.
+```bash
+# Default cache location for ilab model download is ~/.cache/instructlab/models
+# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
+s3cmd sync s3-data/granite-7b-starter s3://<your-bucket-name>/granite-7b-starter
 ```
 
 [granite-7b-starter]: https://catalog.redhat.com/software/containers/rhelai1/granite-7b-starter/667ebf10abaa082bcf96ea6a
@@ -147,11 +148,11 @@ metadata:
   namespace: <data-science-project-name/namespace>
 type: Opaque
 stringData:
-  JUDGE_NAME:               # Name of the judge model or deployment
-  JUDGE_ENDPOINT:           # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
-  JUDGE_API_KEY:            # Deployed model-server auth token
-  JUDGE_CA_CERT:            # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
-  JUDGE_CA_CERT_CM_KEY:     # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
+  JUDGE_NAME: <judge-model-name>                              # Name of the judge model or deployment
+  JUDGE_ENDPOINT: <judge-model-endpoint>                      # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
+  JUDGE_API_KEY: <judge-model-api-key>                        # Deployed model-server auth token
+  JUDGE_CA_CERT: <judge-model-ca-cert-config-map-name>        # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
+  JUDGE_CA_CERT_CM_KEY: <judge-model-ca-cert-config-map-key>  # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
 ```
 
 > [!NOTE]
@@ -258,10 +259,10 @@ stringData:
       "type": "s3",
       "access_key_id": "your_accesskey",
       "secret_access_key": "your_secretkey",
-      "endpoint_url": "https://s3-us-east.amazonaws.com",
+      "endpoint_url": "https://s3-us-east-2.amazonaws.com",
       "bucket": "mybucket",
       "default_bucket": "mybucket",
-      "region": "us-east"
+      "region": "us-east-2"
     }
 kind: Secret
 metadata:
@@ -428,11 +429,11 @@ metadata:
   namespace: <data-science-project-name/namespace>
 type: Opaque
 stringData:
-  api_key:              # Deployed model-server auth token
-  endpoint:             # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
-  model: mixtral        # Name of the teacher model or deployment
-  SDG_CA_CERT:          # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
-  SDG_CA_CERT_CM_KEY:   # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
+  api_key:  <teacher-model-api-key>                      # Deployed model-server auth token
+  endpoint: <teacher-model-endpoint>                     # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
+  model: <teacher-model-name>         # Name of the teacher model or deployment
+  SDG_CA_CERT:  <teacher-model-ca-config-map-name>       # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
+  SDG_CA_CERT_CM_KEY: <teacher-model-ca-config-map-key>  # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
 ```
 
 > [!NOTE]
@@ -503,18 +504,10 @@ Now we can continue to set up the required resources in our cluster.
 
 The following resources will be created:
 
-1. ConfigMap
-2. Secret
-3. ClusterRole
-4. ClusterRoleBinding
-5. Pod
-
-Create a configMap that contains the [standalone.py script](standalone.py)
-
-```bash
-$ curl -OL https://raw.githubusercontent.com/red-hat-data-services/ilab-on-ocp/refs/heads/rhoai-2.16/standalone/standalone.py
-$ oc create configmap -n <data-science-project-name/namespace> standalone-script --from-file ./standalone.py
-```
+1. Secret
+1. ClusterRole
+1. ClusterRoleBinding
+1. Pod
 
 Create a secret resource that contains the credentials for your Object Storage (AWS S3 Bucket)
 
@@ -525,13 +518,13 @@ metadata:
   name: sdg-object-store-credentials
 type: Opaque
 stringData:
-  bucket:                     # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket)
-  access_key:                 # The object store access key (AWS Access key ID)
-  secret_key:                 # The object store secret key (AWS Secret Access Key)
-  data_key:                   # The name of the tarball that contains SDG data.
-  endpoint:                   # The object store endpoint
-  region:                     # The region for the object store.
-  verify_tls:                 # Verify TLS for the object store.
+  bucket: <s3-bucket-name>             # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket)
+  access_key: <s3-access-key>          # The object store access key (AWS Access key ID)
+  secret_key: <s3-secret-key>          # The object store secret key (AWS Secret Access Key)
+  data_key: <s3-path-to-teacher-model-files>                  # The name of the tarball that contains SDG data.
+  endpoint: <s3-endpoint>              # The object store endpoint
+  region: <s3-region>                  # The region for the object store.
+  verify_tls: true                     # Verify TLS for the object store.
 ```
 
 Apply the yaml file to the cluster
@@ -580,7 +573,7 @@ These are the required [RBAC configuration] which we are applying on the Service
 
 From within the RHOAI dashboard, navigate to the **Data Science Pipelines** page and click **Configure pipeline server**. This will present you with a form where you can upload the credentials for the S3 bucket you created in the previous step.
 
-<p align="center"><img src="assets/images/configure_pipeline_server.png" width=50%\></p>
+<p align="center"><img src="assets/images/configure_pipeline_server.png" width=50%></p>
 
 ### Run the Pipeline
 
@@ -607,6 +600,7 @@ Once the pipeline is uploaded we will be able to select **Create run** from the
 |`sdg_scale_factor` |SDG parameter. The total number of instructions to be generated|
 |`sdg_pipeline` |SDG parameter. Data generation pipeline to use. Available: 'simple', 'full', or a valid path to a directory of pipeline workflow YAML files. Note that 'full' requires a larger teacher model, Mixtral-8x7b.|
 |`sdg_max_batch_len` |SDG parameter. Maximum tokens per gpu for each batch that will be handled in a single step.|
+|`sdg_sample_size` | SDG parameter. Sampling size used for Synthetic Data Generation |
 |`train_nproc_per_node` |Training parameter. Number of GPUs per each node/worker to use for training.|
 |`train_nnodes` |Training parameter. Number of nodes/workers to train on.|
 |`train_num_epochs_phase_1` |Training parameter for in Phase 1. Number of epochs to run training.|
@@ -628,5 +622,76 @@ Once the pipeline is uploaded we will be able to select **Create run** from the
 |`final_eval_merge_system_user_message` |Final model evaluation parameter for MT Bench Branch. Boolean indicating whether to merge system and user messages (required for Mistral based judges)|
 |`k8s_storage_class_name` |A Kubernetes StorageClass name for persistent volumes. Selected StorageClass must support RWX PersistentVolumes.|
 
+##### Suggested Parameters: Full Pipeline
+To run the ilab Pipeline at full capabilities, we suggest using these values:
+| Parameter | Suggested Value |
+|---------- | ---------- |
+|`sdg_repo_url` | https://github.com/instructlab/taxonomy.git |
+|`sdg_repo_branch` | "" |
+|`sdg_repo_pr` | 0 |
+|`sdg_base_model` | s3://<BUCKET>/<PATH_TO_MODEL> |
+|`sdg_scale_factor` | 30 |
+|`sdg_pipeline` | "full" |
+|`sdg_max_batch_len` | 5000 |
+|`sdg_sample_size` | 1.0 |
+|`train_nproc_per_node` | 2 |
+|`train_nnodes` | 2 |
+|`train_num_epochs_phase_1` | 7 |
+|`train_num_epochs_phase_2` | 10 |
+|`train_effective_batch_size_phase_1` | 128 |
+|`train_effective_batch_size_phase_2` | 3840 |
+|`train_learning_rate_phase_1` | 2e-05 |
+|`train_learning_rate_phase_2` | 6e-06 |
+|`train_num_warmup_steps_phase_1` | 1000 |
+|`train_num_warmup_steps_phase_2` | 1000 |
+|`train_save_samples` | 250000 |
+|`train_max_batch_len` | 5000 |
+|`train_seed` | 42 |
+|`mt_bench_max_workers` | "auto" |
+|`mt_bench_merge_system_user_message` | False |
+|`final_eval_max_workers` | "auto" |
+|`final_eval_few_shots` | 5 |
+|`final_eval_batch_size` | "auto" |
+|`final_eval_merge_system_user_message` | False |
+|`k8s_storage_class_name` | standard |
+Note that this will take a very long time, on the scale of double-digit hours of runtime.
+##### Suggested Parameters: Development
+Running the ilab pipeline at full capabilities takes a very long time, and with a good amount of resource consumption.
+To create a e2e run that completes much quicker (at the expense of output quality), and with fewer resources (namely, GPU nodes) we suggest using these values instead:
+| Parameter | Suggested Value |
+|---------- | ---------- |
+|`sdg_repo_url` | https://github.com/instructlab/taxonomy.git |
+|`sdg_repo_branch` | "" |
+|`sdg_repo_pr` | 0 |
+|`sdg_base_model` | s3://<BUCKET>/<PATH_TO_MODEL> |
+|`sdg_scale_factor` | 30 |
+|`sdg_pipeline` | "simple" |
+|`sdg_max_batch_len` | 5000 |
+|`sdg_sample_size` | **0.0002** |
+|`train_nproc_per_node` | **1** |
+|`train_nnodes` | **1** |
+|`train_num_epochs_phase_1` | **2** |
+|`train_num_epochs_phase_2` | **2** |
+|`train_effective_batch_size_phase_1` | **3840** |
+|`train_effective_batch_size_phase_2` | 3840 |
+|`train_learning_rate_phase_1` | **.0001** |
+|`train_learning_rate_phase_2` | **.0001** |
+|`train_num_warmup_steps_phase_1` | **800** |
+|`train_num_warmup_steps_phase_2` | **800** |
+|`train_save_samples` | **0** |
+|`train_max_batch_len` | **20000** |
+|`train_seed` | 42 |
+|`mt_bench_max_workers` | "auto" |
+|`mt_bench_merge_system_user_message` | False |
+|`final_eval_max_workers` | "auto" |
+|`final_eval_few_shots` | 5 |
+|`final_eval_batch_size` | "auto" |
+|`final_eval_merge_system_user_message` | False |
+|`k8s_storage_class_name` | standard |
+
+Using these parameters will allow a user to run the complete pipeline much quicker; in testing we have found this to take about 90 minutes.
+Additionally, we can point the `judge-server` and `teacher-server` to the same Mistral model, which only uses 1 GPU, and the PyTorchJob configuration
+specified here also only uses 2 training nodes of 1 GPU, so a total of 3 GPUs are required, rather than the 8-9 GPUs required for the full pipeline.
+With that said, the output model quality is likely very poor, and these should only be used for testing purposes.
 
 [RBAC configuration]: https://github.com/opendatahub-io/ilab-on-ocp/tree/main/standalone#rbac-requirements-when-running-in-a-kubernetes-job