Skip to content

Commit

Permalink
Update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
VaniHaripriya committed Jan 28, 2025
1 parent ba72eda commit 2914019
Showing 1 changed file with 111 additions and 46 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ This file provides step-by-step instructions for setting up and using the Data S

Before running the training and evaluation steps we must complete the following steps:

1. [Prepare data and push to object store](#prepare-data-and-push-to-object-store)
1. [Prepare base model and push to object store](#prepare-base-model-and-push-to-object-store)
1. [Setting up Judge & Teacher model](#setting-up-judge--teacher-model)
* [Deploy a judge model server](#deploy-a-judge-model-server-optional) (Optional)
* [Deploy judge model serving details](#deploy-judge-model-serving-details)
Expand All @@ -44,12 +44,12 @@ Before running the training and evaluation steps we must complete the following
1. [Setup NFS StorageClass](#optional---setup-nfs-storageclass) (Optional)
1. [Set Up Data Science Pipelines Server and Run InstructLab Pipeline](#set-up-data-science-pipelines-server-and-run-instructLab-pipeline)

### Prepare data and push to object store
### Prepare base model and push to object store

Download [granite-7b-starter] model and push it to your object store.
You will need a base model to train the ilab pipeline on, so to begin, upload the [granite-7b-starter] model to your object store.

```bash
$ mkdir -p s3-data/{model}
$ mkdir -p s3-data/
```

Download ilab model repository in s3-data model directory
Expand All @@ -58,19 +58,20 @@ Download ilab model repository in s3-data model directory
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
$ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.2
$ cp -r <path-to-model-downloaded-dir>/rhelai1/granite-7b-starter/* s3-data/model
$ cp -r <path-to-model-downloaded-dir>/rhelai1/granite-7b-starter s3-data/granite-7b-starter
```

Upload the downloaded model to your object store.

The `pipeline.py` script will do a simple validation check on the directory structure, here is a sample of what
the script expects:
Generate tar archive
```bash
$ cd s3-data
$ tar -czvf rhelai.tar.gz *
```

```text
model/config.json
model/tokenizer.json
model/tokenizer_config.json
model/*.safetensors
Upload the created tar archive to your object store.
```bash
# Default cache location for ilab model download is ~/.cache/instructlab/models
# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
s3cmd sync s3-data/granite-7b-starter s3://<your-bucket-name>/granite-7b-starter
```

[granite-7b-starter]: https://catalog.redhat.com/software/containers/rhelai1/granite-7b-starter/667ebf10abaa082bcf96ea6a
Expand Down Expand Up @@ -147,11 +148,11 @@ metadata:
namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
JUDGE_NAME: # Name of the judge model or deployment
JUDGE_ENDPOINT: # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
JUDGE_API_KEY: # Deployed model-server auth token
JUDGE_CA_CERT: # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
JUDGE_CA_CERT_CM_KEY: # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
JUDGE_NAME: <judge-model-name> # Name of the judge model or deployment
JUDGE_ENDPOINT: <judge-model-endpoint> # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
JUDGE_API_KEY: <judge-model-api-key> # Deployed model-server auth token
JUDGE_CA_CERT: <judge-model-ca-cert-config-map-name> # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
JUDGE_CA_CERT_CM_KEY: <judge-model-ca-cert-config-map-key> # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
```
> [!NOTE]
Expand Down Expand Up @@ -258,10 +259,10 @@ stringData:
"type": "s3",
"access_key_id": "your_accesskey",
"secret_access_key": "your_secretkey",
"endpoint_url": "https://s3-us-east.amazonaws.com",
"endpoint_url": "https://s3-us-east-2.amazonaws.com",
"bucket": "mybucket",
"default_bucket": "mybucket",
"region": "us-east"
"region": "us-east-2"
}
kind: Secret
metadata:
Expand Down Expand Up @@ -428,11 +429,11 @@ metadata:
namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
api_key: # Deployed model-server auth token
endpoint: # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
model: mixtral # Name of the teacher model or deployment
SDG_CA_CERT: # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
SDG_CA_CERT_CM_KEY: # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
api_key: <teacher-model-api-key> # Deployed model-server auth token
endpoint: <teacher-model-endpoint> # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
model: <teacher-model-name> # Name of the teacher model or deployment
SDG_CA_CERT: <teacher-model-ca-config-map-name> # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
SDG_CA_CERT_CM_KEY: <teacher-model-ca-config-map-key> # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
```
> [!NOTE]
Expand Down Expand Up @@ -503,18 +504,10 @@ Now we can continue to set up the required resources in our cluster.

The following resources will be created:

1. ConfigMap
2. Secret
3. ClusterRole
4. ClusterRoleBinding
5. Pod

Create a configMap that contains the [standalone.py script](standalone.py)

```bash
$ curl -OL https://raw.githubusercontent.com/red-hat-data-services/ilab-on-ocp/refs/heads/rhoai-2.16/standalone/standalone.py
$ oc create configmap -n <data-science-project-name/namespace> standalone-script --from-file ./standalone.py
```
1. Secret
1. ClusterRole
1. ClusterRoleBinding
1. Pod

Create a secret resource that contains the credentials for your Object Storage (AWS S3 Bucket)

Expand All @@ -525,13 +518,13 @@ metadata:
name: sdg-object-store-credentials
type: Opaque
stringData:
bucket: # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket)
access_key: # The object store access key (AWS Access key ID)
secret_key: # The object store secret key (AWS Secret Access Key)
data_key: # The name of the tarball that contains SDG data.
endpoint: # The object store endpoint
region: # The region for the object store.
verify_tls: # Verify TLS for the object store.
bucket: <s3-bucket-name> # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket)
access_key: <s3-access-key> # The object store access key (AWS Access key ID)
secret_key: <s3-secret-key> # The object store secret key (AWS Secret Access Key)
data_key: <s3-path-to-teacher-model-files> # The name of the tarball that contains SDG data.
endpoint: <s3-endpoint> # The object store endpoint
region: <s3-region> # The region for the object store.
verify_tls: true # Verify TLS for the object store.
```
Apply the yaml file to the cluster
Expand Down Expand Up @@ -580,7 +573,7 @@ These are the required [RBAC configuration] which we are applying on the Service
From within the RHOAI dashboard, navigate to the **Data Science Pipelines** page and click **Configure pipeline server**. This will present you with a form where you can upload the credentials for the S3 bucket you created in the previous step.
<p align="center"><img src="assets/images/configure_pipeline_server.png" width=50%\></p>
<p align="center"><img src="assets/images/configure_pipeline_server.png" width=50%></p>
### Run the Pipeline
Expand All @@ -607,6 +600,7 @@ Once the pipeline is uploaded we will be able to select **Create run** from the
|`sdg_scale_factor` |SDG parameter. The total number of instructions to be generated|
|`sdg_pipeline` |SDG parameter. Data generation pipeline to use. Available: 'simple', 'full', or a valid path to a directory of pipeline workflow YAML files. Note that 'full' requires a larger teacher model, Mixtral-8x7b.|
|`sdg_max_batch_len` |SDG parameter. Maximum tokens per gpu for each batch that will be handled in a single step.|
|`sdg_sample_size` | SDG parameter. Sampling size used for Synthetic Data Generation |
|`train_nproc_per_node` |Training parameter. Number of GPUs per each node/worker to use for training.|
|`train_nnodes` |Training parameter. Number of nodes/workers to train on.|
|`train_num_epochs_phase_1` |Training parameter for in Phase 1. Number of epochs to run training.|
Expand All @@ -628,5 +622,76 @@ Once the pipeline is uploaded we will be able to select **Create run** from the
|`final_eval_merge_system_user_message` |Final model evaluation parameter for MT Bench Branch. Boolean indicating whether to merge system and user messages (required for Mistral based judges)|
|`k8s_storage_class_name` |A Kubernetes StorageClass name for persistent volumes. Selected StorageClass must support RWX PersistentVolumes.|

##### Suggested Parameters: Full Pipeline
To run the ilab Pipeline at full capabilities, we suggest using these values:
| Parameter | Suggested Value |
|---------- | ---------- |
|`sdg_repo_url` | https://github.com/instructlab/taxonomy.git |
|`sdg_repo_branch` | "" |
|`sdg_repo_pr` | 0 |
|`sdg_base_model` | s3://<BUCKET>/<PATH_TO_MODEL> |
|`sdg_scale_factor` | 30 |
|`sdg_pipeline` | "full" |
|`sdg_max_batch_len` | 5000 |
|`sdg_sample_size` | 1.0 |
|`train_nproc_per_node` | 2 |
|`train_nnodes` | 2 |
|`train_num_epochs_phase_1` | 7 |
|`train_num_epochs_phase_2` | 10 |
|`train_effective_batch_size_phase_1` | 128 |
|`train_effective_batch_size_phase_2` | 3840 |
|`train_learning_rate_phase_1` | 2e-05 |
|`train_learning_rate_phase_2` | 6e-06 |
|`train_num_warmup_steps_phase_1` | 1000 |
|`train_num_warmup_steps_phase_2` | 1000 |
|`train_save_samples` | 250000 |
|`train_max_batch_len` | 5000 |
|`train_seed` | 42 |
|`mt_bench_max_workers` | "auto" |
|`mt_bench_merge_system_user_message` | False |
|`final_eval_max_workers` | "auto" |
|`final_eval_few_shots` | 5 |
|`final_eval_batch_size` | "auto" |
|`final_eval_merge_system_user_message` | False |
|`k8s_storage_class_name` | standard |
Note that this will take a very long time, on the scale of double-digit hours of runtime.
##### Suggested Parameters: Development
Running the ilab pipeline at full capabilities takes a very long time, and with a good amount of resource consumption.
To create a e2e run that completes much quicker (at the expense of output quality), and with fewer resources (namely, GPU nodes) we suggest using these values instead:
| Parameter | Suggested Value |
|---------- | ---------- |
|`sdg_repo_url` | https://github.com/instructlab/taxonomy.git |
|`sdg_repo_branch` | "" |
|`sdg_repo_pr` | 0 |
|`sdg_base_model` | s3://<BUCKET>/<PATH_TO_MODEL> |
|`sdg_scale_factor` | 30 |
|`sdg_pipeline` | "simple" |
|`sdg_max_batch_len` | 5000 |
|`sdg_sample_size` | **0.0002** |
|`train_nproc_per_node` | **1** |
|`train_nnodes` | **1** |
|`train_num_epochs_phase_1` | **2** |
|`train_num_epochs_phase_2` | **2** |
|`train_effective_batch_size_phase_1` | **3840** |
|`train_effective_batch_size_phase_2` | 3840 |
|`train_learning_rate_phase_1` | **.0001** |
|`train_learning_rate_phase_2` | **.0001** |
|`train_num_warmup_steps_phase_1` | **800** |
|`train_num_warmup_steps_phase_2` | **800** |
|`train_save_samples` | **0** |
|`train_max_batch_len` | **20000** |
|`train_seed` | 42 |
|`mt_bench_max_workers` | "auto" |
|`mt_bench_merge_system_user_message` | False |
|`final_eval_max_workers` | "auto" |
|`final_eval_few_shots` | 5 |
|`final_eval_batch_size` | "auto" |
|`final_eval_merge_system_user_message` | False |
|`k8s_storage_class_name` | standard |

Using these parameters will allow a user to run the complete pipeline much quicker; in testing we have found this to take about 90 minutes.
Additionally, we can point the `judge-server` and `teacher-server` to the same Mistral model, which only uses 1 GPU, and the PyTorchJob configuration
specified here also only uses 2 training nodes of 1 GPU, so a total of 3 GPUs are required, rather than the 8-9 GPUs required for the full pipeline.
With that said, the output model quality is likely very poor, and these should only be used for testing purposes.

[RBAC configuration]: https://github.com/opendatahub-io/ilab-on-ocp/tree/main/standalone#rbac-requirements-when-running-in-a-kubernetes-job

0 comments on commit 2914019

Please sign in to comment.