Skip to content

Commit

Permalink
Revamp Standalone
Browse files Browse the repository at this point in the history
This commit does multiple things:

* Move all standalone related work into the 'standalone' directory
* Add documentation on how to use the script, see README.md
* A a new "show" command to display a Kubernetes Job example on how to
  run the script as part of a Kubernetes job. Run it like so:
  `./standalone.py show`
* Rework the SDG fetch to allow for push - this will be revisited once
  the final evaluation is implemented
* Hide all the SDG related command for running SDG with the script, now
  the script only fetches SDG data
* Add Makefile to generate the standalone script and format it with Ruff

Signed-off-by: Sébastien Han <[email protected]>
  • Loading branch information
leseb committed Oct 8, 2024
1 parent e144ac2 commit 118acd5
Show file tree
Hide file tree
Showing 6 changed files with 581 additions and 46 deletions.
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.PHONY: standalone

standalone:
python3 pipeline.py gen-standalone
ruff format standalone/standalone.py
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,13 @@ To deploy a signed certificate in cluster follow [trusted cluster cert](signed-c
This solution requires object storage to be in place either through S3 or using Noobaa.

If you are using Noobaa apply the following [tuning paramters](noobaa/README.md)

## Standalone Deployment

See [standalone](standalone/README.md) for instructions on deploying the Instructlab solution
without the need for RHOAI.
To generate the `standalone.py` script, run the following command ([ruff](https://docs.astral.sh/ruff/installation/) tool must be installed):

```bash
make standalone
```
18 changes: 10 additions & 8 deletions pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -563,19 +563,20 @@ def gen_standalone():

# Open the template file
try:
with open(
STANDALONE_TEMPLATE_FILE_NAME, "r", encoding="utf-8"
) as template_file:
standalone_template_path = path.join(
"standalone", STANDALONE_TEMPLATE_FILE_NAME
)
with open(standalone_template_path, "r", encoding="utf-8") as template_file:
template_content = template_file.read()
except FileNotFoundError as e:
click.echo(
f"Error: The template file '{STANDALONE_TEMPLATE_FILE_NAME}' was not found.",
f"Error: The template file '{standalone_template_path}' was not found.",
err=True,
)
raise click.exceptions.Exit(1) from e
except IOError as e:
click.echo(
f"Error: An I/O error occurred while reading '{STANDALONE_TEMPLATE_FILE_NAME}': {e}",
f"Error: An I/O error occurred while reading '{standalone_template_path}': {e}",
err=True,
)
raise click.exceptions.Exit(1)
Expand All @@ -585,7 +586,7 @@ def gen_standalone():
template = Template(template_content)
except TemplateSyntaxError as e:
click.echo(
f"Error: The template file '{STANDALONE_TEMPLATE_FILE_NAME}' contains a syntax error: {e}",
f"Error: The template file '{standalone_template_path}' contains a syntax error: {e}",
err=True,
)
raise click.exceptions.Exit(1)
Expand All @@ -594,10 +595,11 @@ def gen_standalone():
rendered_code = template.render(details)

# Write the rendered code to a new Python file
with open(GENERATED_STANDALONE_FILE_NAME, "w", encoding="utf-8") as output_file:
standalone_script_path = path.join("standalone", GENERATED_STANDALONE_FILE_NAME)
with open(standalone_script_path, "w", encoding="utf-8") as output_file:
output_file.write(rendered_code)

click.echo(f"Successfully generated '{GENERATED_STANDALONE_FILE_NAME}' script.")
click.echo(f"Successfully generated '{standalone_script_path}' script.")


def get_executor_details(
Expand Down
186 changes: 186 additions & 0 deletions standalone/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Standalone Tool Documentation

## Overview

The `standalone.py` script simulates the [InstructLab](https://instructlab.ai/) workflow within a [Kubernetes](https://kubernetes.io/) environment,
replicating the functionality of a [KubeFlow Pipeline](https://github.com/kubeflow/pipelines). This allows for distributed training and evaluation of models
without relying on centralized orchestration tools like KubeFlow.

The `standalone.py` tool provides support for fetching generated SDG (Synthetic Data Generation) data from an object store that complies with the AWS S3 specification. While AWS S3 is supported, alternative object storage solutions such as Ceph, Nooba, and MinIO are also compatible.

## Requirements

Since the `standalone.py` script is designed to run within a Kubernetes environment, the following
requirements must be met:
* A Kubernetes cluster with the necessary resources to run the InstructLab workflow.
* Nodes with GPUs available for training.
* [KubeFlow training operator](https://github.com/kubeflow/training-operator) must be running and `PyTorchJob` CRD must be installed.
* A [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) that supports dynamic provisioning with [ReadWriteMany](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) access mode.
* A Kubernetes configuration that allows access to the Kubernetes cluster.
* Both cluster and in-cluster configurations are supported.
* SDG data generated and uploaded to an object store.

> [!NOTE]
> The script can be run outside of a Kubernetes environment or within a Kubernetes Job, but it requires a Kubernetes configuration file to access the cluster.
> [!TIP]
> Check the `show` command to display an example of a Kubernetes Job that runs the script. Use `./standalone.py show` to see the example.
## Features

* Run any part of the InstructLab workflow in a standalone environment independently or a full end-to-end workflow:
* Fetch SDG data from an object store.
* Train model.
* Evaluate model.
* Final model evaluation. (Not implemented yet)
* Push the final model back to the object store. (Not implemented yet) - same location as the SDG data.

The `standalone.py` script includes a main command to execute the full workflow, along with
subcommands to run individual parts of the workflow separately. To view all available commands, use
`./standalone.py --help`. Essentially, this is how the script operates to execute the end-to-end
workflow. The command will retrieve the SDG data from the object store and set up the necessary
resources for running the training and evaluation steps:

```bash
./standalone.py run --namespace my-namespace --sdg-object-store-secret sdg-data
```

Now let's say you only want to fetch the SDG data, you can use the `sdg-data-fetch` subcommand:

```bash
./standalone.py run --namespace my-namespace --sdg-object-store-secret sdg-data sdg-data-fetch
```

Other subcommands are available to run the training and evaluation steps:

```bash
./standalone.py run --namespace my-namespace train
./standalone.py run --namespace my-namespace evaluation
```

* Fetch SDG data from an object store.
* Support for AWS S3 and compatible object storage solutions.
* Configuration via CLI options, environment variables, or Kubernetes secrets.

> [!CAUTION]
> All the CLI options MUST be positioned after the `run` command and BEFORE any subcommands.
## Usage

The script requires information regarding the location and method for accessing the SDG data. This information can be provided in two main ways:

1. CLI Options or/and Environment Variables: Supply all necessary information via CLI options or environment variables.
2. Kubernetes Secret: Provide the name of a Kubernetes secret that contains all relevant details using the `--sdg-object-store-secret` option.

### CLI Options

* `--namespace`: The namespace in which the Kubernetes resources are located - **Required**
* `--storage-class`: The storage class to use for the PVCs - **Optional** - Default: cluster default storage class.
* `--nproc-per-node`: The number of processes to run per node - **Optional** - Default: 1.
* `--sdg-object-store-secret`: The name of the Kubernetes secret containing the SDG object store credentials.
* `--sdg-object-store-endpoint`: The endpoint of the object store. `SDG_OBJECT_STORE_ENDPOINT` environment variable can be used as well.
* `--sdg-object-store-bucket`: The bucket name in the object store. `SDG_OBJECT_STORE_BUCKET` environment variable can be used as well.
* `--sdg-object-store-access-key`: The access key for the object store. `SDG_OBJECT_STORE_ACCESS_KEY` environment variable can be used as well.
* `--sdg-object-store-secret-key`: The secret key for the object store. `SDG_OBJECT_STORE_SECRET_KEY` environment variable can be used as well.
* `--sdg-object-store-data-key`: The key for the SDG data in the object store. e.g., `sdg.tar.gz`. `SDG_OBJECT_STORE_DATA_KEY` environment variable can be used as well.
* `--sdg-object-store-verify-tls`: Whether to verify TLS for the object store endpoint (default:
true). `SDG_OBJECT_STORE_VERIFY_TLS` environment variable can be used as well.
* `--sdg-object-store-region`: The region of the object store. `SDG_OBJECT_STORE_REGION` environment variable can be used as well.

## Example End-To-End Workflow

### Generating and Uploading SDG Data

The following example demonstrates how to generate SDG data, package it as a tarball, and upload it
to an object store. This assumes that AWS CLI is installed and configured with the necessary
credentials.
In this scenario the name of the bucket is `sdg-data` and the tarball file is `sdg.tar.gz`.

```bash
ilab data generate
cd generated
tar -czvf sdg.tar.gz *
aws cp sdg.tar.gz s3://sdg-data/sdg.tar.gz
```

> [!CAUTION]
> Ensures SDG data is packaged as a tarball **without** top-level directories. So you must run `tar` inside the directory containing the SDG data.
### Creating the Kubernetes Secret

The simplest method to supply the script with the required information for retrieving SDG data is by
creating a Kubernetes secret. In the example below, we create a secret called `sdg-data` within the
`my-namespace` namespace, containing the necessary credentials. Ensure that you update the access
key and secret key as needed. The `data_key` field refers to the name of the tarball file in the
object store that holds the SDG data. In this case, it's named `sdg.tar.gz`, as we previously
uploaded the tarball to the object store using this name.

```bash
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Secret
metadata:
name: sdg-data
namespace: my-namespace
type: Opaque
stringData:
bucket: sdg-data
access_key: *****
secret_key: *****
data_key: sdg.tar.gz
EOF

./standalone run --namespace my-namespace --sdg-object-store-secret sdg-data
```

> [!WARNING]
> The secret must be part of the same namespace as the resources that the script interacts with.
> It's inherented from the `--namespace` option.
The list of all supported keys:

* `bucket`: The bucket name in the object store - **Required**
* `access_key`: The access key for the object store - **Required**
* `secret_key`: The secret key for the object store - **Required**
* `data_key`: The key for the SDG data in the object store - **Required**
* `verify_tls`: Whether to verify TLS for the object store endpoint (default: true) - **Optional**
* `endpoint`: The endpoint of the object store, e.g: https://s3.openshift-storage.svc:443 - **Optional**
* `region`: The region of the object store - **Optional**

#### Running the Script Without Kubernetes Secret

Alternatively, you can provide the necessary information directly via CLI options or environment,
the script will use the provided information to fetch the SDG data and create its own Kubernetes
Secret named `sdg-object-store-credentials` in the same namespace as the resources it interacts with (in this case, `my-namespace`).


```bash
./standalone run \
--namespace my-namespace \
--sdg-object-store-access-key key \
--sdg-object-store-secret-key key \
--sdg-object-store-bucket sdg-data \
--sdg-object-store-data-key sdg.tar.gz
```

#### Advanced Configuration Using an S3-Compatible Object Store

If you don't use the official AWS S3 endpoint, you can provide additional information about the object store:

```bash
./standalone run \
--namespace foo \
--sdg-object-store-access-key key \
--sdg-object-store-secret-key key \
--sdg-object-store-bucket sdg-data \
--sdg-object-store-data-key sdg.tar.gz \
--sdg-object-store-verify-tls false \
--sdg-object-store-endpoint https://s3.openshift-storage.svc:443
```

> [!IMPORTANT]
> The `--sdg-object-store-endpoint` option must be provided in the format
> `scheme://host:<port>`, the port can be omitted if it's the default port.
> [!TIP]
> If you don't want to run the entire workflow and only want to fetch the SDG data, you can use the `run sdg-data-fetch ` command
Loading

0 comments on commit 118acd5

Please sign in to comment.