Skip to content

Commit

Permalink
Update docs for cloudpathlib and Python 3.12 (#60)
Browse files Browse the repository at this point in the history
Update docs to reflect switch to `cloudpathlib` and add warnings related
to Python 3.12.
  • Loading branch information
adrianeboyd authored Sep 13, 2023
1 parent 592fcaf commit 1366220
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 63 deletions.
73 changes: 42 additions & 31 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# Command Line Interface

The `weasel` CLI includes subcommands for working with
Weasel projects, end-to-end workflows for building and
deploying custom pipelines.
The `weasel` CLI includes subcommands for working with Weasel projects,
end-to-end workflows for building and deploying custom pipelines.

## :clipboard: clone

Expand Down Expand Up @@ -36,13 +35,13 @@ python -m weasel clone [name] [dest] [--repo] [--branch] [--sparse]
## :open_file_folder: assets
Fetch project assets like datasets and pretrained weights. Assets are defined in
the `assets` section of the [`project.yml`](tutorial/directory-and-assets.md#project-yml). If a
`checksum` is provided, the file is only downloaded if no local file with the
same checksum exists and Weasel will show an error if the checksum of the
downloaded file doesn't match. If assets don't specify a `url` they're
considered "private" and you have to take care of putting them into the
destination directory yourself. If a local path is provided, the asset is copied
into the current project.
the `assets` section of the
[`project.yml`](tutorial/directory-and-assets.md#project-yml). If a `checksum`
is provided, the file is only downloaded if no local file with the same checksum
exists and Weasel will show an error if the checksum of the downloaded file
doesn't match. If assets don't specify a `url` they're considered "private" and
you have to take care of putting them into the destination directory yourself.
If a local path is provided, the asset is copied into the current project.
```bash
python -m weasel assets [project_dir]
Expand All @@ -59,11 +58,12 @@ python -m weasel assets [project_dir]
## :rocket: run

Run a named command or workflow defined in the
[`project.yml`](tutorial/directory-and-assets.md#project-yml). If a workflow name is specified,
all commands in the workflow are run, in order. If commands define
[dependencies or outputs](tutorial/directory-and-assets.md#dependencies-and-outputs), they will only be
re-run if state has changed. For example, if the input dataset changes, a
preprocessing command that depends on those files will be re-run.
[`project.yml`](tutorial/directory-and-assets.md#project-yml). If a workflow
name is specified, all commands in the workflow are run, in order. If commands
define
[dependencies or outputs](tutorial/directory-and-assets.md#dependencies-and-outputs),
they will only be re-run if state has changed. For example, if the input dataset
changes, a preprocessing command that depends on those files will be re-run.

```bash
python -m weasel run [subcommand] [project_dir] [--force] [--dry]
Expand All @@ -80,6 +80,11 @@ python -m weasel run [subcommand] [project_dir] [--force] [--dry]

## :arrow_up: push

> :warning: **Important note on Python 3.12**
>
> As of `cloudpathlib` v0.15.1, Python 3.12 is not yet supported. For remote
> storage, please use Python 3.11 or earlier.
Upload all available files or directories listed as in the `outputs` section of
commands to a remote storage. Outputs are archived and compressed prior to
upload, and addressed in the remote storage using the output's relative path
Expand All @@ -90,10 +95,10 @@ If the contents are different, the new version of the file is uploaded. Deleting
obsolete files is left up to you.

Remotes can be defined in the `remotes` section of the
[`project.yml`](tutorial/directory-and-assets.md#project-yml). Under the hood, Weasel uses
[`Pathy`](https://github.com/justindujardin/pathy) to communicate with the
remote storages, so you can use any protocol that `Pathy` supports, including
[S3](https://aws.amazon.com/s3/),
[`project.yml`](tutorial/directory-and-assets.md#project-yml). Under the hood,
Weasel uses [`cloudpathlib`](https://cloudpathlib.drivendata.org) to communicate
with the remote storages, so you can use any protocol that `CloudPath` supports,
including [S3](https://aws.amazon.com/s3/),
[Google Cloud Storage](https://cloud.google.com/storage), and the local
filesystem, although you may need to install extra dependencies to use certain
protocols.
Expand Down Expand Up @@ -122,6 +127,11 @@ python -m weasel push [remote] [project_dir]
## :arrow_down: pull
> :warning: **Important note on Python 3.12**
>
> As of `cloudpathlib` v0.15.1, Python 3.12 is not yet supported. For remote
> storage, please use Python 3.11 or earlier.
Download all files or directories listed as `outputs` for commands, unless they
are already present locally. When searching for files in the remote, `pull`
won't just look at the output path, but will also consider the **command
Expand All @@ -134,10 +144,10 @@ outputs, so if you change the config back, you'll be able to fetch back the
result.
Remotes can be defined in the `remotes` section of the
[`project.yml`](tutorial/directory-and-assets.md#project-yml). Under the hood, Weasel uses
[`Pathy`](https://github.com/justindujardin/pathy) to communicate with the
remote storages, so you can use any protocol that `Pathy` supports, including
[S3](https://aws.amazon.com/s3/),
[`project.yml`](tutorial/directory-and-assets.md#project-yml). Under the hood,
Weasel uses [`cloudpathlib`](https://cloudpathlib.drivendata.org/) to
communicate with the remote storages, so you can use any protocol that
`CloudPath` supports, including [S3](https://aws.amazon.com/s3/),
[Google Cloud Storage](https://cloud.google.com/storage), and the local
filesystem, although you may need to install extra dependencies to use certain
protocols.
Expand Down Expand Up @@ -167,11 +177,12 @@ python -m weasel pull [remote] [project_dir]
## :closed_book: document
Auto-generate a pretty Markdown-formatted `README` for your project, based on
its [`project.yml`](tutorial/directory-and-assets.md#project-yml). Will create sections that
document the available commands, workflows and assets. The auto-generated
content will be placed between two hidden markers, so you can add your own
custom content before or after the auto-generated documentation. When you re-run
the `project document` command, only the auto-generated part is replaced.
its [`project.yml`](tutorial/directory-and-assets.md#project-yml). Will create
sections that document the available commands, workflows and assets. The
auto-generated content will be placed between two hidden markers, so you can add
your own custom content before or after the auto-generated documentation. When
you re-run the `project document` command, only the auto-generated part is
replaced.
```bash
python -m weasel document [project_dir] [--output] [--no-emoji]
Expand Down Expand Up @@ -199,9 +210,9 @@ Auto-generate [Data Version Control](https://dvc.org) (DVC) config file. Calls
[`dvc run`](https://dvc.org/doc/command-reference/run) with `--no-exec` under
the hood to generate the `dvc.yaml`. A DVC project can only define one pipeline,
so you need to specify one workflow defined in the
[`project.yml`](tutorial/directory-and-assets.md#project-yml). If no workflow is specified, the
first defined workflow is used. The DVC config will only be updated if the
`project.yml` changed. For details, see the
[`project.yml`](tutorial/directory-and-assets.md#project-yml). If no workflow is
specified, the first defined workflow is used. The DVC config will only be
updated if the `project.yml` changed. For details, see the
[DVC integration](tutorial/integrations.md#data-version-control-dvc) docs.
> **Warning**
Expand Down
68 changes: 36 additions & 32 deletions docs/tutorial/remote-storage.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# Remote Storage

> :warning: **Important note on Python 3.12**
>
> As of v0.15.1, `cloudpathlib` does not support Python 3.12. For remote
> storage, please use Python 3.11 or earlier.
You can persist your project outputs to a remote storage using the
[`push`](../cli.md#arrow_up-push) command. This can help you **export**
your pipeline packages, **share** work with your team, or **cache results** to
avoid repeating work. The [`pull`](../cli.md#arrow_down-pull) command will
download any outputs that are in the remote storage and aren't available
locally.
[`push`](../cli.md#arrow_up-push) command. This can help you **export** your
pipeline packages, **share** work with your team, or **cache results** to avoid
repeating work. The [`pull`](../cli.md#arrow_down-pull) command will download
any outputs that are in the remote storage and aren't available locally.

You can list one or more remotes in the `remotes` section of your
[`project.yml`](./directory-and-assets.md#projectyml) by mapping a string name to the URL of the
storage. Under the hood, Weasel uses
[`Pathy`](https://github.com/justindujardin/pathy) to communicate with the
remote storages, so you can use any protocol that `Pathy` supports, including
[S3](https://aws.amazon.com/s3/),
[`project.yml`](./directory-and-assets.md#projectyml) by mapping a string name
to the URL of the storage. Under the hood, Weasel uses
[`cloudpathlib`](https://cloudpathlib.drivendata.org/) to communicate with the
remote storages, so you can use any protocol that `CloudPath` supports,
including [S3](https://aws.amazon.com/s3/),
[Google Cloud Storage](https://cloud.google.com/storage), and the local
filesystem, although you may need to install extra dependencies to use certain
protocols.
Expand All @@ -31,12 +35,12 @@ protocols.
> :information_source: **How it works**
>
> Inside the remote storage, Weasel uses a clever **directory structure** to avoid
> overwriting files. The top level of the directory structure is a URL-encoded
> version of the output's path. Within this directory are subdirectories named
> according to a hash of the command string and the command's dependencies.
> Finally, within those directories are files, named according to an MD5 hash of
> their contents.
> Inside the remote storage, Weasel uses a clever **directory structure** to
> avoid overwriting files. The top level of the directory structure is a
> URL-encoded version of the output's path. Within this directory are
> subdirectories named according to a hash of the command string and the
> command's dependencies. Finally, within those directories are files, named
> according to an MD5 hash of their contents.
>
> ```
> └── urlencoded_file_path # Path of original file
Expand All @@ -47,7 +51,8 @@ protocols.
> └── third_content_hash
> ```
For instance, let's say you had the following spaCy command in your `project.yml`:
For instance, let's say you had the following spaCy command in your
`project.yml`:
```yaml title="project.yml"
- name: train
Expand All @@ -62,9 +67,9 @@ For instance, let's say you had the following spaCy command in your `project.yml
- 'training/model-best'
```
After you finish training, you run [`push`](../cli.md#arrow_up-push) to
make sure the `training/model-best` output is saved to remote storage. Weasel
will then construct a hash from your command script and the listed dependencies,
After you finish training, you run [`push`](../cli.md#arrow_up-push) to make
sure the `training/model-best` output is saved to remote storage. Weasel will
then construct a hash from your command script and the listed dependencies,
`corpus/train`, `corpus/dev` and `config.cfg`, in order to identify the
execution context of your output. It would then compute an MD5 hash of the
`training/model-best` directory, and use those three pieces of information to
Expand All @@ -75,24 +80,23 @@ python -m weasel run train
python -m weasel push
```
``` title="Overview of the S3 bucket"
```title="Overview of the S3 bucket"
└── s3://my-weasel-bucket/training%2Fmodel-best
└── 1d8cb33a06cc345ad3761c6050934a1b
└── d8e20c3537a084c5c10d95899fe0b1ff
```
If you change the command or one of its dependencies (for instance, by editing
the [`config.cfg`](https://spacy.io/usage/training#config) file to tune the hyperparameters),
a different creation hash will be calculated, so when you use
[`push`](../cli.md#arrow_up-push) you won't be overwriting your previous
file. The system even supports multiple outputs for the same file and the same
the [`config.cfg`](https://spacy.io/usage/training#config) file to tune the
hyperparameters), a different creation hash will be calculated, so when you use
[`push`](../cli.md#arrow_up-push) you won't be overwriting your previous file.
The system even supports multiple outputs for the same file and the same
context, which can happen if your training process is not deterministic, or if
you have dependencies that aren't represented in the command.
In summary, the `weasel` remote storages are designed
to make a particular set of trade-offs. Priority is placed on **convenience**,
**correctness** and **avoiding data loss**. You can use
[`push`](../cli.md#arrow_up-push) freely, as you'll never overwrite remote
state, and you don't have to come up with names or version numbers. However,
it's up to you to manage the size of your remote storage, and to remove files
that are no longer relevant to you.
In summary, the `weasel` remote storages are designed to make a particular set
of trade-offs. Priority is placed on **convenience**, **correctness** and
**avoiding data loss**. You can use [`push`](../cli.md#arrow_up-push) freely, as
you'll never overwrite remote state, and you don't have to come up with names or
version numbers. However, it's up to you to manage the size of your remote
storage, and to remove files that are no longer relevant to you.

0 comments on commit 1366220

Please sign in to comment.