Skip to content

Commit 1d48bfb

Browse files
Merge pull request #108 from leseb/standone-on-job
bulk: final :)
2 parents f110f35 + bc43b58 commit 1d48bfb

File tree

4 files changed

+506
-364
lines changed

4 files changed

+506
-364
lines changed

pipeline.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -447,7 +447,7 @@ def gen_standalone():
447447
"exec-git-clone-op": {},
448448
"exec-huggingface-importer-op": 'huggingface_importer_op(repo_name="{REPO_GRANITE_7B_IMAGE}", model="{DATA_PVC_MODEL_PATH}")',
449449
"exec-run-mt-bench-op": 'run_mt_bench_op(best_score_file="{MT_BENCH_SCORES_PATH}",mt_bench_output="{MT_BENCH_OUTPUT_PATH}",models_folder="{CANDIDATE_MODEL_PATH_PREFIX}",models_path_prefix="{CANDIDATE_MODEL_PATH_PREFIX}", max_workers="{MAX_WORKERS}", merge_system_user_message={MERGE_SYSTEM_USER_MESSAGE})',
450-
"exec-run-final-eval-op": 'run_final_eval_op(mmlu_branch_output="{MMLU_BRANCH_SCORES_PATH}", mt_bench_branch_output="{MT_BENCH_OUTPUT_PATH}", candidate_model="{CANDIDATE_MODEL_PATH}", taxonomy="{TAXONOMY_PATH}", tasks="{DATA_PVC_SDG_PATH}", base_branch="", candidate_branch="", device=None, base_model_dir="{DATA_PVC_MODEL_PATH}", max_workers="{MAX_WORKERS}", merge_system_user_message={MERGE_SYSTEM_USER_MESSAGE}, model_dtype="{MODEL_DTYPE}", few_shots={FEW_SHOTS}, batch_size={BATCH_SIZE})',
450+
"exec-run-final-eval-op": 'run_final_eval_op(mmlu_branch_output="{MMLU_BRANCH_SCORES_PATH}", mt_bench_branch_output="{MT_BENCH_BRANCH_SCORES_PATH}", candidate_model="{CANDIDATE_MODEL_PATH}", taxonomy="{TAXONOMY_PATH}", tasks="{DATA_PVC_SDG_PATH}", base_branch="", candidate_branch="", device=None, base_model_dir="{DATA_PVC_MODEL_PATH}", max_workers="{MAX_WORKERS}", merge_system_user_message={MERGE_SYSTEM_USER_MESSAGE}, model_dtype="{MODEL_DTYPE}", few_shots={FEW_SHOTS}, batch_size={BATCH_SIZE})',
451451
}
452452

453453
details = {}

standalone/README.md

+144-2
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,144 @@ The `standalone.py` script is designed to run within a Kubernetes environment. T
2626
> [!TIP]
2727
> Check the `show` command to display an example of a Kubernetes Job that runs the script. Run `./standalone.py show`.
2828
29+
### RBAC Requirements when running in a Kubernetes Job
30+
31+
The script manipulates a number of Kubernetes resources, and therefore requires the following RBAC
32+
permissions on the [ServiceAccount](https://kubernetes.io/docs/concepts/security/service-accounts/)
33+
running the script:
34+
35+
```yaml
36+
# logs
37+
- verbs:
38+
- get
39+
- list
40+
apiGroups:
41+
- ""
42+
resources:
43+
- pods/log
44+
# Jobs
45+
- verbs:
46+
- create
47+
- get
48+
- list
49+
- watch
50+
apiGroups:
51+
- batch
52+
resources:
53+
- jobs
54+
# Pods
55+
- verbs:
56+
- create
57+
- get
58+
- list
59+
- watch
60+
apiGroups:
61+
- ""
62+
resources:
63+
- pods
64+
# Secrets
65+
- verbs:
66+
- create
67+
- get
68+
apiGroups:
69+
- ""
70+
resources:
71+
- secrets
72+
# ConfigMaps
73+
- verbs:
74+
- create
75+
- get
76+
apiGroups:
77+
- ""
78+
resources:
79+
- configmaps
80+
# PVCs
81+
- verbs:
82+
- create
83+
apiGroups:
84+
- ""
85+
resources:
86+
- persistentvolumeclaims
87+
# PyTorchJob
88+
- verbs:
89+
- create
90+
- get
91+
- list
92+
- watch
93+
apiGroups:
94+
- kubeflow.org
95+
resources:
96+
- pytorchjobs
97+
# Watchers
98+
- verbs:
99+
- get
100+
- list
101+
- watch
102+
apiGroups:
103+
- ""
104+
resources:
105+
- events
106+
```
107+
108+
### Run in a Kubernetes Job
109+
110+
The script can be run in a Kubernetes Job by creating a Job resource that runs the script. The
111+
`show` subcommand displays an example of a Kubernetes Job that runs the script:
112+
113+
```bash
114+
./standalone/standalone.py show \
115+
--image quay.io/opendatahub/workbench-images:jupyter-datascience-ubi9-python-3.11-20241004-609ffb8 \
116+
--script-configmap standalone \
117+
--script-name script \
118+
--namespace leseb \
119+
--args "--storage-class=nfs-csi" \
120+
--args "--namespace=leseb" \
121+
--args "--sdg-object-store-secret=sdg-object-store-credentials" \
122+
--args "--judge-serving-model-secret=judge-serving-details"
123+
124+
apiVersion: batch/v1
125+
kind: Job
126+
metadata:
127+
name: distributed-ilab
128+
namespace: leseb
129+
spec:
130+
template:
131+
spec:
132+
containers:
133+
- args:
134+
- --storage-class=nfs-csi
135+
- --namespace=leseb
136+
- --sdg-object-store-secret=sdg-object-store-credentials
137+
- --judge-serving-model-secret=judge-serving-details
138+
command:
139+
- python3
140+
- /config/script
141+
- run
142+
image: quay.io/opendatahub/workbench-images:jupyter-datascience-ubi9-python-3.11-20241004-609ffb8
143+
name: distributed-ilab
144+
volumeMounts:
145+
- mountPath: /config
146+
name: script-config
147+
restartPolicy: Never
148+
serviceAccountName: default
149+
volumes:
150+
- configMap:
151+
name: standalone
152+
name: script-config
153+
```
154+
155+
Optional arguments can be added to the `args` list to customize the script's behavior. They
156+
represent the script options that would be passed to the script if run from the command line.
157+
158+
List of available options of the `show` subcommand:
159+
160+
* `--namespace`: Kubernetes namespace to run the job
161+
* `--name`: Name of the job
162+
* `--image`: The image to use for the job
163+
* `--script-configmap`: The name of the ConfigMap that holds the script
164+
* `--script-name`: The name of the script in the ConfigMap
165+
* `--args`: Additional arguments to pass to the script - can be passed multiple times
166+
29167
## Features
30168

31169
* Run any part of the InstructLab workflow in a standalone environment independently or a full end-to-end workflow:
@@ -36,7 +174,9 @@ The `standalone.py` script is designed to run within a Kubernetes environment. T
36174
* Evaluate model by running MT_Bench with `evaluation` subcommand along with `--eval-type mt-bench` option.
37175
* Final model evaluation with `evaluation` subcommand along with `--eval-type final` option.
38176
* Final evaluation runs both MT Bench_Branch and MMLU_Branch
39-
* Push the final model back to the object store - same location as the SDG data with `upload-trained-model` subcommand.
177+
* Push the final model back to the object store - same location as the SDG data with
178+
`upload-trained-model` subcommand.
179+
* Dry-run mode to print the generated Kubernetes resources without executing - `--dry-run` option.
40180

41181
> [!NOTE]
42182
> Read about InstructLab model evaluation in the [instructlab/eval repository](https://github.com/instructlab/eval/blob/main/README.md).
@@ -124,7 +264,9 @@ evaluation
124264
* `--training-1-epoch-num`: The number of epochs to train the model for phase 1. **Optional** - Default: 7.
125265
* `--training-2-epoch-num`: The number of epochs to train the model for phase 2. **Optional** -
126266
Default: 10.
127-
* `--eval-type`: The evaluation type to use. **Optional** - Default: `mt-bench`. Available options: `mt-bench`, `final`.
267+
* `--eval-type`: The evaluation type to use. **Optional** - Default: `mt-bench`. Available options:
268+
`mt-bench`, `final`.
269+
* `--dry-run`: Print the generated Kubernetes resources without executing them. **Optional** - Default: false.
128270

129271

130272
## Example Workflow with Synthetic Data Generation (SDG)

0 commit comments

Comments
 (0)