Add guide to overwrite dataset path

mlcommons · Jan 10, 2025 · 5fa2abb · 5fa2abb
1 parent 70fed93
commit 5fa2abb
Showing 1 changed file with 15 additions and 0 deletions.
diff --git a/mixture_of_experts_pretraining/README.md b/mixture_of_experts_pretraining/README.md
@@ -210,6 +210,13 @@ python ~/xpk/xpk.py workload create \
 --num-slices=<num_slices> \
 --command="bash script.sh"
 ```
+Note that the dataset path defaults as follows in [`dataset/c4_mlperf.yaml`](config/dataset/c4_mlperf.yaml)
+```
+train_dataset_path: gs://mlperf-llm-public2/c4/en_json/3.0.1
+eval_dataset_path: gs://mlperf-llm-public2/c4/en_val_subset_json
+```
+You can freely overwrite the workload command by adding
+`dataset.train_dataset_path=/path/to/train/dir dataset.eval_dataset_path=/path/to/eval/dir`, and the path should support both local directory and gcs buckets.
 
 ## Run Experiments in GCE
 
@@ -326,6 +333,14 @@ EOF
 "
 ```
 
+Note that the dataset path defaults as follows in [`dataset/c4_mlperf.yaml`](config/dataset/c4_mlperf.yaml)
+```
+train_dataset_path: gs://mlperf-llm-public2/c4/en_json/3.0.1
+eval_dataset_path: gs://mlperf-llm-public2/c4/en_val_subset_json
+```
+You can freely overwrite the workload command by adding
+`dataset.train_dataset_path=/path/to/train/dir dataset.eval_dataset_path=/path/to/eval/dir`, and the path should support both local directory and gcs buckets.
+
 #### Logging
 The workload starts only after all worker SSH connections are established, then it is safe and recommended to manually exit.
 The provided scripts may exceed the SSH connection timeout without manully exit, causing unexpected command retries, which may lead to some error message stating that command error since the TPU devices are currently in use. However, this should not disrupt your existing workload.